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Preface 


This book is addressed to students and research workers. It is concerned, 
not with proofs and deductions, but with the problems of fitting analyses 
to data, of translating the research worker’s questions into arithmetical 
operations, and of interpreting the results of the analyses. The equations 
and operations of each multivariate technique are worked through, 
starting with the raw data, and ending with statements of percentages, 
distances and angles. The purpose is to give the reader a complete grasp 
of the method, so that he can understand what is being done to the data, 
and ask himself whether the form of the analysis is appropriate to decid- 
ing questions in his own field of inquiry. 

The modern research worker will rarely need to calculate the principal 
components of a matrix by hand. But, though the work is done for him 
by a computer, he must understand the significance of the calculations 
to which the data are being subjected. The following chapters are in- 
tended to provide the necessary illumination in this and in other areas 
of multivariate analysis. 

The fact that the routine work is now usually done by machines means 
that this book can be simpler than accounts of multivariate methods 
which were written in the days before the computer. The older books 
had to find ways of avoiding decimals in order to make the arithmetic 
easier. But the cost of simplicity in the arithmetic was complexity in the 
algebra. The basic formulae for the methods of multivariate statistics 
are quite simple, and their simplicity can be maintained by a suitable 
choice of artificial data. In the following chapters the methods are illus- 
trated by means of arithmetical examples which can be reworked with 
pencil and paper in a matter of minutes. 

Although the examples all purport to deal with data obtained in the 
field of psychology, the economist, the sociologist, the biologist, and 
workers in other fields will have no difficulty in translating the scores of 
persons on tests into terms which are more familiar within their own 
disciplines. i f 

In the past, multivariate techniques have sometimes been applied to 
data which cannot bear their weight. The research worker finds it easy 
to appreciate why an analysis of many variables should not, in general, 
be carried out on a small sample of persons. Only too often, however, 
he tries to persuade himself that the size of his sample compensates for 
its lack of randomness. He should keep constantly in mind the simple 
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logical principle that results can be generalized from a sample to a 
Population only if the sample is randomly drawn from the population. 
But there is another sense in which the research worker may misuse 
multivariate methods, and that is by applying them to measures which 
have low reliability and validity. A multivariate technique should not be 
employed unless there is good evidence that the variables are measuring 
the characteristics which they are supposed to measure, and that the 
measurement is accurate. 

The research worker often comes away from a consultation with a 
Statistician feeling that reciprocal understanding was not established. 
While the research worker spoke of people or factories or social net- 
Works, the statistician was visualizing clusters of points dispersed in a 
Space of several dimensions, When a statistician evolves a formula 
which, to use his language, ‘maximizes a distance’, or ‘tests for hetero- 
geneity of dispersion’, he does not, in a sense, know the full implication 
of his phrases. He has produced a rule or pattern, and it is for the re- 
Search worker to infer the consequences of imposing that pattern upon 
data drawn from his own field. If the research worker is not aware of 
the formal Properties of the rule, he will not be able to arrive at an 
intuitive grasp of the Meaning of his analysis. 7" 

Tf statisticians find it convenient to think of their matrix operations in 
geometrical terms, it may be inferred that the research worker will find it 
informative to represent the conclusions of his analysis in a geometrical 
picture. For this reason, emphasis has been laid on the actual calculation 
of angles between lines and distances between points. Methods of pro- 
jection are Worked out explicitly so that relations in the data can be 
tepresented on a two- or three-dimensjonal surface. The advantage of a 

“torial sumtnary of an analysis is that it can be made to present a great 

4y relations simultaneously. Sia 
othing will help the research worker to understand the statistician 
much as the ability to think of his data as points and lines in space. 
Set of scores of a person on a number of variables, ¢ say, may Be 
“sented by a single point if each of the variables specifies a dimen- 
of a multidimensional Space. This mode of representation is an 
hsion to ¢ dimensions of the ordinary graph, in which a person is 
ted by reference to two dimensions. In the same way, the means of 
tal persons for f variables may be expressed as a single point, which 
e centre of gravity of the points of the persons from whom the means 
= derived. If there are several groups of persons, then there will be 
val mean points, one for each group. Within a group the person- 
S will be scattered or dispersed about their mean point, and the 
of the dispersion may be altered by stretching or contracting the 
in which the points lie. It is possible, for example, to change an 
cal swarm of points in a two-dimensional space into a circular 
swarm by contracting the space along the line of the major axis of the 
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ellipse. Whenever, in the following pages, an algebraic formula is given 
a geometrical interpretation, the reader should exercise himself in 

imagining the space of the analysis changing under the impact of the 

calculations. 

Methods of Multivariate Analysis is a self-contained exposition of the 
main points of the various techniques. The only difficult chapter is the 
first, which contains the basic mathematics necessary to an understand- 
ing of the methods described in the following chapters. If the reader fails 
in a determined assault on this chapter, he should retire and make a 
more gradual approach by means of the author’s Elementary Statistics: 
A Workbook. In the fifth chapter of that book he will find matrix 
methods explained at greater length, and he will be able to exercise 
himself on the simple problems which are given there, together with 
their solutions. 

The understanding of the geometry of matrix operations requires 
intelligent concentration. But there are many people with this capacity 
who are brought to a halt in the initial stages by an ignorance of the 
meaning of simple symbols and formulae. In the following pages the 
operations of statistical mathematics are introduced, illustrated, and 
explained at the same time as they are reduced to symbols. The names of 
Greek letters are supplied, so that the reader who is unfamiliar with the 
Greek alphabet has a handle by which to grasp each symbol. 

No derivations, deductions or proofs are attempted. When he employs 
an accepted technique, such as.a recognized test of significance, the 
research worker may reasonably assume that the mathematical statis: 
ticians have made no mistakes in their algebra, and that thé arg me 
from the assumptions to the arithmetic operations is formally vali at 
an understanding of what it is that a particular technique does requires. 
a grasp of the structure of the arithmetic. The justification of a method” 
is something more than its deduction. In the case of some techniques, 
the operations were historically prior to their mathematical derivation. 
This was true, for example, of discriminant function analysis, which was 
employed by psychologists before it had been elaborated mathematically. 
Formal proof is, of course, a necessary condition of the validity of a 


technique; but the fact that practical workers have found it necessa to 
ed, ` 


invent the technique is generally a sufficient guarantee of its useful 

The methods described in the following pages are well esai cc 
with the exception of the interpretative technique discussed in the last 
chapter. Statisticians may cavil at the practice of trying to ets the 
relative importance of variables in a regression equation by ca oe 
Coefficients of separate determination. The coi poe a 
because they are a possible solution to the problem oi the oe n 
interpretation of a multivariate analysis, a problem which lies at the 
heart of the research worker’s interest in multivariate methods: 

In any exposition of modern multivariate methods there are two 
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strands, the one statistical and the other algebraic. The first is concerned 
with calculating the probability that the value of some sample statistic 
is representative of some population parameter. All the statistical tech- 
niques described in the following chapters are parametric in that they 
assume multivariate normality of distribution and homogeneity of vari- 
ances and covariances. The second strand—the mathematical—is con- 
cerned with the reduction of ordered collections of numbers (matrices) 
in forms according to certain rules. The statistical techniques arg 
typi Oncerned with estimating parameters, whereas the algebraic 
techniques are typically concerned with maximizing a quantity. If he 
Keeps this distinction in mind, the reader will avoid many pitfalls in the 
planning of his analyses. 
Factor analysis has, in the past, been regarded by some statisticians as 
a cuckoo in the multivariate nest; it is very obtrusive, but of doubtful 
lineage. The introduction by Sir Cyril Burt of the simple summation 
Method, known in the United States as the centroid method, in the early 
years of the century enabled psychologists and economists to factorize 
‘extensive bodies of data and to establish the existence of certain patterns 
which have recurred with some consistency in several studies. The arrival 
of the computer released a spate of analyses, some nonsensical and many 
valueless. Factor analysis attained complete statistical respectability 
when Dr D. N. Lawley introduced methods of maximum likelihood into 
the field. The writer has little doubt that the widespread adoption of 
Lawley’s methods would considerably abate the flood of published 
analyses. This book touches on factor analysis only in its discussion of 
the method of principal components, Hand-computing techniques for 
both the method of principal components and the simple summation 
method have’been described in Elementary Statistics: A Workbook. The 
description of the method of principal components in the present book 
is concerned, not with their calculation, but with their interpretation. An 
understanding of the method is essential to an understanding of all other 
areas of multivariate analysis. It is also useful in the interpretation of 
Tesults obtained in those areas. 


—" 


Note on the Handbook of Multivariate Methods 
Programmed in Atlas Autocode 


The techniques expounded in Methods of Multivariate Analysis have been 
programmed by the author in the computer language known as Atlas 
Autocode. By arrangement with the Director and the Head of the 
Operations Group of the Atlas Computer Laboratory, Chilton, Berk- 
shire, the programmes, which are stored on magnetic tape, have been 
made available to research workers who have access to the appropriate 
facilities for tape preparation. The hard-backed edition of this book 
includes a Handbook which describes the programmes. The description 
of each programme consists of a simple example of an actual set of data, 
laid out in the format required by the computer, followed by the output 
which the computer actually gave when this set of data was fed in. In 
each case the data are those used to illustrate the corresponding tech- 
nique in Methods of Multivariate Analysis. 

The immediate purpose of this mode of presentation is to enable the 
research worker to fit his data to a pattern, to prepare his data tape, and 
to trace the steps by which the computer transforms the data into the 
output. In the longer term it is hoped that knowledge of multivariate 
analysis, combined with consciousness of the kinds of pattern to which 
data may be made to conform, will enable the research worker to plan 
his investigation in a full realization of the aims and requirements of the 


various methods of multivariate analysis. 


Chapter 1 Matrix Algebra 


Matrix algebra is an extension of ordinary algebra, which provides an 
efficient means of manipulating sets of linear equations and linear 
transformations. Most of the rules of ordinary algebra have analogues 
in matrix algebra. Matrix methods can, however, be employed without 
explicit reference to their origins, and this is how they are used in this 
book. 

A matrix is a rectangular array of numbers (or symbols standing for 
numbers), which are called its elements. A matrix may have any number 
of rows and any number of columns. If it has only one row, it is called 
a row vector. If it has only one column, it is called a column vector. In 
this book matrices other than vectors are represented by capital letters 
in bold type; vectors are represented by small letters in bold type. 

The multiplication of two matrices follows a ‘row by column’ rule; 
i.e. a row of the first matrix is multiplied by a column of the second. In 
the following example a 2 x 2 matrix A (i.e. a matrix with two rows 
and two columns) is multiplied by another 2 x 2 matrix B. 


A B 

col, col, col, col, | 

TOW, 4, de row, bu ba 

TOW, 4, a row, ba be 

Product = C 
col, col; 

TOW, Cu = anbu + ba Cia = Abie + Grobog 
Cos = Anbiyo + Asaba 


TOW: Cy = Gaby, + aba 


It will be seen that the product of the ith row of A and the jth column 


of B forms the element c;; of C. Riel: ol 
Tn the following arithmetical example a 2 x 3 matrix E is multiplied 


by a3 x 2 matrix F: 


E F 
OY 2 —1 2 
Sade S) —4 0 
4 —2 


15 
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Product = G 
£u 4+ 8=4 g =0+0 4 = —4 
3 = —3— 16+ 20=1 gz = 6 +0 — 10 = —4 


In the next example a 3 x 2 matrix X is multiplied by a 2 X 1 matrix 
y, that is, a column vector (a vector is a matrix with only one row or 
One column of elements and is represented by a small letter). 


X y Product = z 
=2 4 2 z=—4+ 8= 4 
0 4 4 0+ 16 = 16 
hS) z= 2+12=14 


_, In general, a p x q matrix can be multiplied by a r X s matrix only 
if q =r, that is, only if the first matrix has as many columns as the 
second has rows. The product of the two matrices is a p X s matrix. 
The three examples above illustrate this rule. 

In ordinary algebra ab = ba. This law does not hold in matrix 
algebra. AB is not, in general, equal to BA. Indeed, either or both of 
these products may not exist. It would not be possible to multiply the 
2x 1 matrix y (above) by the 3 x 2 matrix X. For this reason we 
distinguish premultiplication and postmultiplication. In the product 
AB, B is premultiplied by A and A is postmultiplied by B. P 

The principal diagonal of a square matrix—that is, of a matrix with 
as many rows as columns—is the diagonal running from the top-left 
corner to the bottom-right corner. A matrix with ones in the principal 
diagonal and zeros everywhere else is called the unit matrix, and is 
represented by the letter I. A matrix remains unchanged when multi- 
plied by the unit matrix, just as, in ordinary algebra, a value remains 
unchanged when multiplied by unity. 


A I = A 
10 4 6 LER OAO, 10 4 6 
8 4 3 Ona cy 10 8 4S 
14 16 2 OF RO SL a pe 


A matrix remains unaltered when it is postmultiplied or premultiplied 
by the unit matrix. 


AI=IA =A 


The transpose of a matrix is the matrix rewritten with rows as columns 
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and hence with columns as rows. The transpose of matrix A is repre- 
sented by A’ (‘A prime’). For example: 


M M’ 
Lees 3 
—1 4 Phe cick v3} 
—3 3 


Two matrices are added or subtracted by adding or subtracting corre- 
sponding elements. As an example of subtraction: 


H = J = K 
iPad ted) a} 0O 1 4 1 1-1 
—1 4 2 —2 4 3 1 0—1 


Only matrices with the same number of rows and the same number of 
columns can be added or subtracted. 

In matrix algebra an ordinary single number is called a scalar and is 
represented by a Greek letter such as A (lambda) or p (mu). If a matrix 
is multiplied by a scalar, every element of the matrix is multiplied by the 
scalar: 

A N = 1p 

2 TELE 4 
1A Se y) Sige 


2 


A multiplication which often appears in statistical work is that of a 
lambda (representing a latent root) by the unit matrix: 


A I = N 
Oates OO 
35 0 0 
abe OE O 
OOO, 0 0 35 0 
Oe. O72 0 0 0 35 


A square matrix A can be reduced to a simple form by calculating its 
determinant, which is a single number represented by |A|. If A is a 
2 x 2 matrix, the determinant is the product of the north-west and 
south-east elements, less the product of the other two elements: 


A JA] 
AT ax4—@x3)=-2 


3 4 
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The reader who is familiar with methods of solving linear equations 
may find it illuminating to write out the matrix equation Ax = zasa 
Set of linear equations: 


MX F AX = 2, 


T aX + 


In order to eliminate x, the first equation must be multiplied by dys, the 
second must be multiplied by a,., and one equation must then be sub- 
tracted from the other. The result may be written 


Xı = (2409 — 23042)/(411422 — 412421) 


Tf we go back to the original pair of equations and eliminate x, we 
obtain 


X2 = (Zalı — 2441) /(Qy,422 — 424m) 


The denominator of each of these equations is the determinant of A, or 


its eliminant, as it used to be called. f 

If the order of a matrix is greater than 2 x 2, the determinant may 
be calculated by splitting it up according to certain rules until the re- 
sulting submatrices are of the order 2 x 2. We shall illustrate the 
method on a 3 x 3 matrix. The matrix is given the following chess- 
board pattern of signs starting with a plus in the north-west corner: 


| ee Se yt 
(apy Se 
Let us suppose that the elements of the matrix are as follows (for ease 


of reference the element in the first row and the first column is given the 
value eleyen, and so on): 


x 
[1 213) 
[2 22 23 
l3 32a u53 


The element 11 is associated withthe 2 x 2 submatrix, which is made 
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up of elements which are not in the same row and column as 11, and the 
signs are read in from the chess-board pattern: 


+11 


This is calculated as 


+11[(4+-22 x +33) — (—32 x —23)] 110 


Next the element 21 is associated with the 2 x 2 submatrix, which is 
made up of elements not in the same row and column: 


| —12 +13 
—21 


| —32 +33 


21[(—12 x +33) — (—32 x +13)] = —420 


Lastly the element 31 is associated with its submatrix: 


431[(—12 x —23) — (+22 x +13)] = —310 


The determinant is the sum of these three values: 


|X| 110 — 420 — 310 = —840 


Another way of arriving at the value of the determinant is to calculate 
it as the product of the latent roots of the matrix. The nature of latent 


roots is explained in chapter 4. Pree y 
If the determinant of a matrix is zero, the matrix is said to be singular. 


A zero determinant occurs if at least one of the rows of the matrix can 
be calculated from some weighted combination of the remaining rows: 
in other words, if one of the equations is redundant. If a matrix contains 
redundancy, then at least one of its latent roots is zero. 

The inverse (or reciprocal) of a square matrix is such that, when the 
matrix is multiplied by its inverse, the product is the unit matrix. The 
inverse of A is represented by A~* just as, in ordinary algebra, 1/a@ can 
be represented as a. 
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A AE T 
NE IR 1 0 
i 4 4 ita 0 1 


The multiplication of a matrix and its inverse is one of the special cases 
in Which AB = BA, as in ordinary algebra, since 


AA —=T= AA 


y 


__ No inverse can be calculated for a matrix whose determinant is zero, 


that is, for a singular matrix. 

The definition of the inverse is simple, but the arithmetic involved in 
calculating it is often quite complicated. There is no really quick method 
if the order of the matrix is greater than, say, 3 X 3. This is the sort 
of calculation which is best left to a computer. 

The rules of matrix algebra are much the same as those of ordinary 
arithmetic, apart from the distinction between premultiplication and 
Postmultiplication mentioned above. Division is effected by premulti- 
plying or postmultiplying by an inverse. For example, if we wish to 
divide T by S we calculate ST. 

Apart from vectors, most of the important matrices employed by 
Statisticians are square and symmetric. A square matrix is one with as 
many rows as columns. A symmetric matrix is a square matrix in which 
element a; = element aj; for all i and j. An obvious example is the 
Vatiance-covariance matrix V, in which the covariance of test one and 
test two is the same as the covariance of test two and test one, and 
Similarly for all pairs of tests. For example: 


t (Ee | 96 | 
| 

(i E NA 4 2 | 
| 

tz PG 2 16 


The correlation matrix R can be derived quite simply from the 
Vatiance-coyariance matrix by dividing each off-diagonal element by 
the square root of the product (or the product of the square roots) of the 
diagonal elements in the same row and the same column. For example, 
the correlation of test one and test three is given by 


6 


/25 x /16 Oo 


fis = Tay 


(1:00) 0:70 0-30 
(1-00) 0:25 
(1-00) 


$ H 
The correlation of a test with itself is usually assumed to be p 


Le. unity. (is 
In geometrical terms a vector is a line in a given orienta) T 


to certain axes: 


A 
gy 


on 


In the ae the line OP is a vector and it is rep 


<—— 20 — > 


5 
al AF 2 
rigin, and it projects on to the axis la nee point 20 uni he 
the -Ina etic work with vectors it is cust ary to 
: 1 m of quares ( 3S) 


15 
75 = 96 


20 
75 08 


0-6? + 0-82 = unity 


: ‘flee Elementary Statistics: A Workbook. London: Pergamon. 
w. [1966] Introducing Mathematics 4: A Path to Modern Mathe- a R 


cs. Harmondsworth, Middlesex: Penguin. 
E, L. L. [1947] Multiple-factor Analysis. Chicago: University of 


i] 
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Chapter 2 Within and Between 


Aristotle distinguished genera and differentiae. Roughly speaking, a 
genus is a class which has a characteristic which is common to all mem- 
bers of that class, and a differentia is a characteristic which belongs to 
members of one subgroup a but not to members of the other subgroups 
b, c, etc. (see Metaphysics, 1057-9). Aristotle usually treated the fixed, 
all-or-none predicates as qualitative attributes; statisticians make a dis- 
tinction between fixed attributes and quantitative variables. If a class 
is characterized by variable x (wealth, aggression, extraversion, ete.), 
then it will usually be the case that some members have more of it than 
others. We may suppose that all human beings, as distinct from other 
animals, possess wealth; but some have.a good deal more than others. 
Subgroups within the class (for example, nations) can be distinguished 
not by any absolute difference in amount of wealth but by the fact that, 
on the average, some subgroups have more than others. Variability, or 
yariance, in amount of wealth can be calculated over all human beings, 
ignoring their subgrouping; this yields an estimate of total variance. 
Variability between subgroups of human beings can be calculated; this 
yields an estimate of between groups variance. The remaining variance 
(total minus between) is an average of all the variances within subgroups. 

Now, in the case of wealth, it is obvious that the within groùp variance 
is not a meaningful sort of average, sincé the gap between rich and poor 
is much greater in some countries than in others. But suppose that we 
limit ourselves to a collection of countries which are more homogeneous 
in this respect. We might take a random sample of persons from each 
of the countries of western Europe, calculate the average wealth of each 
sample, and ask whether the differences among these averages reflect 
actual differences in the average wealth of the countries from which the 
samples were drawn. Univariate analysis of variance answers this ques- 
tion by dividing the between groups variance by the within group 
variance, and asking whether this ratio is so large that it is improbable 
that it is due to chance. If the variance ratio is large, we conclude that 
the differences among the sample averages cannot simply be the result 
of taking small samples of a quantity which varies from person to per- 
son. We then say we have evidence that the countries of western Europe 
differ in their average wealth per head of population. y 

Unless there is homogeneity of within group variance, analysis of - 
variance cannot be carried out. (The relations among (a) normality of 
distribution, (b) homogeneity of variance, and (c) tests of differences 
between means are discussed by Box, 1953.) 


ic 23 


x, 4 


io homogeneous from group to group? Express each 2 ‘al 
tion from the group mean, and calculate the sum of squares Net 
n each group. cg 
Group 2 
= 
+1 
Seu 
0 
0 
—1 
4 
i 


nclude that the variance is fairly constant from group to group, 
f oceed to test whether the group means differ more than one 


, we replace each person’s score by the mean for the group to 
belongs, expressed as a deviation from the overall mean. That 

sore becomes X, — Ñ., which is (2 — 4) = —2 for persons 
1 and (5 — 4) = 1 for persons in group 2. 


Group 1 Group 2 
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Clearly the calculation of between groups SS may be simplified by 
multiplying the square of X, — X, by the number of persons in the 
group, and summing over all groups. 


Group 1 Group 2 
X,—X, =o +1 
nX, — £} 12 6 Between groups SS 18 


The advantage of setting out the calculations in full is that it reveals the 
logic of analysis of variance. What we are doing is to analyse a person’s 
score into two parts, that part which is accounted for by the hypothesis 
of a difference between the means of the groups, and that part which is 
unrelated to the hypothesis. The score which is analysed into two parts 
is not a person’s raw score, but the deviation of his raw score from the 
overall mean. For the first person in the first group this deviation is 
X — X, = 1 — 4 = —3. This score is analysed into that part which 
is accounted for by the hypothesis of a difference between the means of 
the two groups, viz. —2, and that part which is not related to the 
hypothesis, viz. —1. The two parts of the score add up to the deviation 
from the overall mean, viz. —3. 

The total SS is the SS of the deviation of each score from the overall” 
mean: 


Group 1 Group 2 


E 0 
=i +2 
9) +2 
+1 
+ 
0 
> (x-y 14 10 Total SS 24 © 


With g groups we have g — 1 between groups degrees of freedom, 
and the degrees of freedom within groups is the sum of n, — 1 over 
all groups. We set out the results in a final table: 


Source d.f. SS Mean square = Variance F ratio 
Between 1 18 18/1 a = 94-00 
Within 7 6 6/7 / 

Total 8 24 


Tf the variance ratio F is close to unity, then the difference between 
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groups is no more than would be accounted for by sampling error. We 
use a table of F to determine how improbable such a conclusion is for 
any obtained value of F. 

Multivariate analysis of variance is analogous to univariate analysis 
of variance. Sums of squares are calculated for each variable and sums 
of products are calculated for each pair of variables by an analogous 
formula. Suppose that we have values of a second variable for each 
person in the above example: 


Group 1 Group 2 
X wv X Y. 
1 1 4 6 
3 2 6 8 
2 3 6 8 
5 10 
5 10 
4 6 
Overall means 
x, 2 5 xX, 4 
Group 
means We 2 8 Y, 6 


Now we wish to know whether the variability and the covariability 
in the two groups are sufficiently homogeneous for us to go on and test 
the variability and covariability of the means. Within each group we 
express the scores on a variable as deviations from the group mean for 
that variable, and we calculate sums of squares (SS) and sum of products 


(SP). 


Group 1 Group 2 

X Y X Y 
=I] —1 -2 
+1 0 +1 0 
0% =+1 +1 0 
0 +2 

OF. F2 

—]1 —2 

SS > (x — X,)? 2 4 
ss, > (Y= FF 2 16 
SP, >(xX¥—X)(Y—Y,) +1 +4 


These may be written in matrix form as follows: 


Within and Between 21, 


W Wa 
X Y X Y; 

X 2 1 4 
Y 1 2 4 16 


These matrices are within group sums of squares and sums of products ~ 
matrices, and correspond to the within group SS in the univariate case. 
To calculate the variance-covariance matrix of each group, which corre- 
sponds to the within group mean square or variance, we divide W, by 
ng — 1. In matrix terms this means that we multiply the matrix W, 
by the scalar 1/(n, — 1) to obtain the variance—covariance matrix Vg. 


Vi 


wH 
eho 


Are these variance-covariance matrices (which are also known as 
dispersion matrices) sufficiently similar for us to carry on with the 
analysis? Each dispersion matrix is an estimate of the dispersion in the 
population of which the group is a random sample. In statistical terms 
what we are asking is whether the two populations from which the two 
groups are drawn are identical with respect to dispersions, that is, with 
respect to the variance of X, the variance of Y and the covariance of 
X and Y. Now statistics never claims to prove identity, Just as science 
in general cannot claim to prove a universal negative proposition. All 
that statistics can do is to look as hard as possible for a difference and, 
if no difference is found, permit us to assume identity until further 
evidence is adduced against it. So we set up the null hypothesis that the 
populations have identical dispersions, and we try to knock it down. 

In order to test the null hypothesis we need to know the matrix W, 
which is the sum of the W,. W is the multivariate equivalent of the 
Within group sums of squares of the univariate analysis of variance, 
Just as we calculate the within group mean square by dividing the sum 
of squares by the degrees of freedom which are 3(n, — 1), so we calcu- 
late the within group dispersion matrix V by dividing W by the within 


group degrees of freedom. 


1 
Vousaaa 


ral formula for the test of significance of the homogeneity 
roup dispersions is ( 


V,|) %-)) 
sf = oz TT {yi} 


and apply 7A 


Sa; =a,+a,+...+% 
Mama, X ap X... X ar 


e we have to calculate the product 


EEA 
\NI VI 
ide the brackets are determinants of matrices. For example, a 
erminant of V; is E 
‘ WVil=@x )—-Gx)=075 
determinants which we require are 
[V] = 0:750000 
|V.| = 1-920000 
[V| = 1-693878 
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usual logs to the base 10, and any log to the base 10 may be turned into 
a log to the base e by multiplying it by 2-30259. It will be recalled that 
raising a term to the power k is equivalent to multiplying its log by k, 


log a* = k log a 


and multiplication of two terms a and b is effected by adding their 
logarithms, 


log ab = log a + log b 
So our chi square may be written 


x? = —(2 log, 0442771 + 5 log, 1:133494) 
= 1-0029 


The degrees of freedom are (t = 2 is the number of variables) 


(g — Dit + 1/2 = 2 — 122 + 1)/2 
=3 


The value of chi square is quite small and gives us no reason to 
suppose that the populations from which the two samples are drawn 
differ in their dispersions. We are reasonably secure in assuming that 
we may use V to represent a dispersion which is common to all the 
groups, and this assumption is made when we perform a multivariate 
analysis of variance. f A 4 

If there is only one variable, the test of homogeneity of dispersia i 
reduces to the ordinary chi square test of homogeneity of ae 

Although it might seem desirable to carry out tne ae the ae 
geneity of dispersions whenever a multivariate analysis © Reda is to 
be performed, in fact the test soften omitted because its interpretation 
presents difficulties. A non-significant result is fairly easy to interpret, 
particularly when the samples are large. But a significant chi square 
may mean either that dispersions are heterogeneous or that distributions 
are non-normal. The test of the homogeneity of dispersions is valid only 
if the distributions are multivariate-normal, and it is sensitive to depar- 
tures from n ity. y 

Although ae MENEAR; analysis of variance also assumes normality 
Of distributions, it may be less sensitive to departures from ae! 
than the test of homogeneity Of dispersions: Furtherntore, itho ma 
Variate analysis of variance can survive a certain amount of hetero- 
geneity among the dispersions of the groups. aun aS eee the 
Preliminary test of homogeneity is often neglected. The test may, how- 
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ever, be made as a matter of routine in the course of a computer analysis, 
and the writer of a research report should warn his readers if chi square 
is large. 


Multivariate analysis of variance (analysis of dispersion) 


The means of a sample on several variables may be thought of as a 
single point in a space which has as many dimensions as there are 
variables. The difference between two such sample means may be tested 
by a multivariate equivalent of the or F test. Just as F may be employed 
to test differences among the means of several groups, so the multi- 
variate analysis of variance may be employed to test the differences 
among the multivariate means of several groups. The multivariate 
analysis of variance is sometimes known as the analysis of dispersion; 
but the dispersion referred to is the total dispersion and not, as in thi 

Previous section, the dispersion within groups. 


In order to perform the analysis, we first of all require the within 


group sums of squares and sums of products matrix W. 


Now in calculating a variance ratio in the ordinary univariate analysis 
of variance we usually divide the between groups mean square by the 
within group’ mean square. In multivariate analysis the procedure is in 
principle the same, but in fact we calculate an L-criterion as the ratio 
of within to total sums of squares and sums of products. To calculate 
the total sums of squares and sums of products matrix T, we express 
every score as a deviation from the overall mean for that test and sum 
the squares and products. 


Group 1 Group 2 

X Y X Y 

= =9 0 0 

—1 —4 +2 +2 

wE 42 49 

+1 444 

+1 444 

0 0 
SS S(x— X 24 
SS° > (Y= FF 90 


SP > (X -— X,)\(¥— Ý.) 41 


Í 
| 


Asa check we can calculate the between groups sums of 
sums of products matrix B, which should be equal to T — 
purpose we express each group mean as a deviation fro: 
mean for that test: 


Group 1 
Xe Yu 
—2 —4 
SS, n, — £). 12 
; e024) 48 
"SP, 1(X%,—X)(Yo— F) 2⁄4 


- The between groups and within group sums of squi 

"products matrices sum to the total sums of squares and 
matrix, which shows that the arithmetic is nen We 
Bh L as the ratio of two determinants. A 


L can vary between zero and unity. If L is unity, 
„between groups variance or | covenen this me 
as the same mean score on a parti menet an 


32 Methods of Multivariate Analysis 


The degrees of freedom are 
t(g — 1) = 222 —1) =2 


Thus we have evidence that the two groups are drawn from popula- 
tions which differ in their mean quantity of X, or in their mean quantity 
of Y, or in some compound of the two. To find out which variables 
contribute to the multivariate difference in means, a univariate analysis 
might be applied to each variable in turn. Alternatively, a discriminant 
function analysis (see chapter 6) could be carried out with the purpose 
Of assessing the relative importance of each variable to the dimension 
which maximizes the difference between the two samples. : 

In practice we should not, of course, accept the verdict of an approxi- 
mate test unless it was based on many more persons than the nine in 
our example. If our samples are rather small, we should employ an 
exact test, if one is available. Exact tests employing the F permet 
are available if the number of variables is two, and/or if the number o. 
groups is two or three (Kendall, 1957, p. 132; Rao, 1952, p. 260). The 
chi square approximation may be improved by replacing (Šg) by 
[Èn — 1 — X(t + g)] in the formula for chi square. This may be done 
as a matter of course, whatever the size of samples. A still better, but 
slightly more complicated, approximation may be achieved by employ- 
ing the F distribution (Rao, 1952, p. 261). When this approximation is 
applied to the simple cases for which an exact Fis known, the exact and 
the approximate values of F are equal. 
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Chapter 3 Factorial Designs 


One of the most powerful devices which the statistician possesses is 
factorial design as applied in the analysis of variance. The simplest type 
of factorial design is the 2 x 2 case. We might, for example, obtain 
the following univariate data in an experiment in which the two factors 
are sex and genetic constitution, and each factor occurs at two levels. 
The two levels for sex are, of course, male and female, and the two 
genetic levels are normal and abnormal. 


Men Women 
Normal Abnormal Normal Abnormal 
4 0 2 2 
6 2 0 4 
Mean 5 Dy 1 3 q ; 3 


We wish to ask three questions: 

(a) Do men differ from women on this variable? 

(b) Do normal people differ from abnormal people? 

(c) Is there some complicated interaction between sex and genetic 
constitution such as that normal men and abnormal women 
on the one hand differ from abnormal men and normal women 
on the other? 

All three hypotheses concern differences among the group means, and 
we wish to test each hypothesis independently of the other two. In other 
words, the part of the between groups variance which is attributable to 
One hypothesis must be quite separate from the parts of the between 
groups variance which are attributable to the other hypotheses. This is 
feasible, provided that each group mean is based on the same weight of 
evidence as every other group mean, i.e. provided that there is the same 
number of persons in each group. Because the geometrical analogue of 
Statistical independence is orthogonality (that is, being at right-angles) 
We refer to this method of analysis as the making of orthogonal designed 
comparisons in the analysis of variance. 

We construct three weighting vectors, one for each hypothesis (on 
Pp. 34). These vectors have two necessary properties: in the first place, the 
elements of each vector sum to zero; and in the second place, the pro- — 
duct of any pair of vectors (x’y, x’z or y’z) is zero. This second condition 
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Group Mean Weighting vectors 
m x y z 
Normal 5 0:5 0:5 0:5 
Men Abnormal l 05. —05 —0:5 
Normal 1 —0:5 0-5 —0:-5 
Momen Abnormal 3 LS O 0:5 


ensures that our comparisons are orthogonal to one another. The sign 
pattern of the vectors is such that the first contrasts men with women, 
the second contrasts normal people with abnormal people, and the third 
contrasts the difference between normal and abnormal men with the 
difference between normal and abnormal women (or, to put it another 
way round, it contrasts the difference between normal men and women 
with the difference between abnormal men and women). The first two 
vectors are used to test the differences due to sex and genetic constitution 
respectively, and the third vector is used to test the difference between 
two differences. All three vectors are normalized; that is, the sum of 
Squares of each is unity. (Normalization is simply a convention which 
ensures that all vectors shall have the same length. It does not affect the 
relations among the elements of a vector, nor does it affect the relations 


between vectors.) To calculate the SS due to sex we multiply the trans- 
pose of the vector of means by x: 


mx = 1 


We square this value and multiply it by n, (the number of persons in 
each group): 


Tn a similar way we calculate the SS for constitution as 2 and the SS for 
the interaction of sex and constitution as 18. We check that they sum 
to the between groups SS = 22. We have now partitioned the between 
groups degrees of freedom (d.f.) and sum of squares (SS) as follows: 


Source d.f. SS MS 
Between groups 3 22 7-33 
Main terms 
Sex 1 2 9 
Constitution 1 2 2 
Interaction term 
Sex x constitution 1 18 18 
Within groups 4 8 2 


Total 7} 30. 
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The degrees of freedom for a factor are obtained by subtracting one 
from the number of levels at which that factor occurs. For example, sex 
occurs at the two levels, male and female, and so it has one degree of 
freedom. The degrees of freedom for the interaction of two (or more) 
factors are obtained by multiplying the degrees of freedom of the inter- 
acting factors. 

In this example the variance (MS) attributable to the two factors* is 
no greater than would be expected by chance, but there is significant 
renes of an interaction between sex and constitution (Fi, = 18/2 = 
9-00). 

The foregoing method of weighting coefficients is mainly of value 

(a) when there are two or more factors, each of which occurs at 
only two levels, or 
(b) when the groups are ordered along some variable (perhaps 
time) and it is desired to test the linearity of the regression of 
the group means on this variable. 
Weighting vectors for these two cases are given by Snedecor (1956, 
chapter 12).} 

The more general case, when some or all factors have more than two 
levels, may be illustrated by a 3 x 2 x 2 analysis in which two stan- 
dardized intelligence tests have been administered to three persons on 
two occasions. The scores are expressed as deviations about the general 
mean of all twelve scores. These scores are not, strictly speaking, means, 
since each cell of the experiment contains only one observation. We do 
not, therefore, have an estimate of variance within cells. For purposes 
of illustration, however, the scores may be treated exactly as if they 


were means. 


ti tə 
0; 05 0; ©, Person means 
Pi 0 —4 2 —6 a) 
P2 0 —2 2 0 0 
Ps 0 0 8 0 2 
Test means —1 +1 
Occasion means 0,+2 
0—2 


* Since the two factors are interacting with one another, it does not make 


sense to ask whether either factor is significant, because, in the presence of 
factor varies according to the level of the other 


interaction, the mean for a 
factor, 


+ It should be noted that Snedecor normalizes his weighting vector after 
multiplying it by the vector of sums. 
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To calculate the sum of squares due to the factor of persons, we 
Square the mean for each person and multiply it by the number of scores 
which contribute to it (4 in each case). Summing the results, we obtain 
32. Similarly, to calculate the sum of squares due to tests, we square 
each test mean and multiply it by the number of scores which contribute 
to it (6 in each case). Summing the results, we obtain 12. Applying the 
same principle to occasions, we obtain SS = 48. If we label the factors 
P for persons, T for tests and O for occasions, we now wish to calculate 
the three first-order interactions PT, PO and TO. We calculate these by 
removing from the original scores those parts which are accounted. for 
by P, T and O. Thus, from the score of person, on test, on occasion, 
(zero) we subtract the mean for Pı (—2), the mean for t, (—1) and the 


mean for o, (+2): 
= (—2)— (—1) — +2) = 1 


Omitting the variance due to the main terms in this way leaves us with 
the following reduced scores: 


ti tə 

Mi 05 0; Oo 
Py 1 1 1 ake] 
Pe =l 1 = 1 
Ps =3 1 3 aa 


To calculate the SS for the person x test interaction, we collapse 
these scores by averaging across occasions: 


t t 
Pi 1 At 
P2 0 0 
Ps Al 1 


We square each element and multiply it by the number of scores which 
contribute to it (2), obtaining an SS of 8. 
To calculate the SS due to PO we collapse the same scores b 


y averag- 
ing across tests: 


0 02 
Pi 1 —1 
P2 sil 1 
Ps 0 0 


The SS is again 8. 
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Collapsing now across persons, we obtain 


The TO SS is 12. 
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We are left with the possibility of a second-order interaction of all 
three factors. To calculate the SS attributable to PTO, we subtract the 
means in the first-order interaction matrices from the previous table of 
reduced scores. As an example: from the reduced score, p,t,0, (1) we 
subtract p,t, in PT (1), p,o, in PO (1) and t,o, in TO (—1), giving a 


new reduced score of zero. 


ti 
0; 02 01 
Pı 0 0 0 
P2 1 =l al 
Ps —l1 1 1 


To obtain the PTO sum of squares we sum the squares of all these scores 
and this gives an SS of 8. The completed analysis is thus 


Source d.f. SS MS 

P 2 32 16 

T 1 12 12 

o 1 48 48 

PT 2 8 4 

PO 2 8 4 

TO 1 12 12 

PTO 2 8 4 

Total 11 128 i 


We check that the total sum of squares of the original scores is 128. 
While this analysis serves to illustrate the method of partitioning the 
etween groups variance in the general case of any number of factors, 
With each factor at any number of levels, it is not a typical factorial 
design. It differs from the usual design in that there is only one observa- 
tion in each of the twelve cells, and so we have no estimate of within 
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group variance against which to test the significance of a factor or inter- 
action term. In these circumstances it is usual to construct a denominator 
for the F ratio. No test is made of the significance of PTO, but the MS 
for PTO is used as a denominator in testing the significance of the first- 
order interactions. Thus the variance ratio for PT is 


F, a = 4/4 = 1-00 


Tf none of the first-order interactions proves significant, we test the main 
terms using a denominator derived from the sums of squares and the 
degrees of freedom of all the interaction terms (but see Snedecor, 1956, 
chapter 12.10 for a better method). 


Denominator of F = 8+8 +1248 = 5:14 
2+2+1+2 
If we suspect that the tests have heterogeneous variances, the obvious 
me to do is to express the test scores in units of standard deviation 
T ore carrying out the analysis. If we suspect that differences between 
© test means are simply artefacts (as will usually be the case in non- 
cognitive tests—we have no way of comparing a mean score of 24 on 
a 48-item questionnaire with a mean score of 5 on a 15-item question- 


naire), then we may express the test scores as deviations from their 
respective test means before standardizing. 


In spite'of these warnings against this particular application of the 
method, the arithmetical example which has been given is quite general 
and applies to any k-factor design. It includes the 2 x 2,2 x 2 x 2 
and, in general, 2" designs as special cases. The only modification is 
that the above table of original scores will, in the more usual type of 
analysis, represent a table of means, the number of persons contributing 
to each mean being constant over all the cells. The table of the com- 
pleted analysis of variance will then include a within group or error 
term. 

Our reason for presenting the k-factor analysis of variance in these 
terms is that, in spite of its limitations, the PTO analysis of psycho- 
logical processes often proves extremely illuminating. Although the 
method has been fairly often described in articles on the design of 
psychological investigations (see, for example, Burt, 1955, and Mah- 
moud, 1955), it has not been employed as often as it might have been- 
Tt is frequently superior to the more usual approach via correlation 
coefficients. 


—_—_ "MS 
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Chapter4 Principal Components 


In technical language the subject of this chapter is the calculation of 
principal axis transformations of quadratic surfaces in a multidimen- 
sional space. It is to be hoped that the reader does not take fright at 
this description: considered intuitively, nothing could be easier than to 
determine the long and short axes of an ellipse; and that is the paradigm 
of the method of principal axes or principal components. The method 
of principal axes is a branch of pure mathematics which is put to prac- 
tical use in fields such as astronomy and engineering. In essence it takes 
a complicated system of relations and reduces it to a canonical or normal 
form; this form is an orthogonal set of axes. In psychology and the 
social sciences the elliptical surface whose axes we determine is a no- 
tional figure. It may be thought of as a contour of equal density. Just 
as we draw contours on a map by joining points of equal height above 
sea level, so we may, in our sample, draw contours by joining together 
points with identical frequencies of persons. If only two tests have been 
used, we may think of each person as represented by a point on an 
ordinary graph whose axes are the tests. If the number of persons is 
large enough, it is fairly easy to trace the contours of the sample by eye. 

For some purposes it is appropriate to assume that the contours of 
equal density are elliptical, or in the case of three or more dimensions 
ellipsoidal. If this assumption can be made, the principal components 
which we calculate may be regarded as the principal axes of the notional 
ellipsoid. A principal axis is a line which has the properties of (a) joining 
the centre of the ellipsoid to its surface, and (b) being perpendicular to 
the tangent at the point where the axis meets the surface. The equations 
of principal component analysis specify the lengths and directions of 
the lines which have these properties. 

A rugby ball is a typical three-dimensional ellipsoid. It has three axes, 
one for each dimension of physical space. The first is the long axis 
joining its pointed ends. The position of the other two axes is indeter- 
minate because they are equal in length. They must be at right-angles to 
the first axis; and they must, therefore, lie in the plane which is revealed 
if we cut the ball in two along its waist-line. But within this plane the 
skin of the ball is circular, not elliptical. So long as they are at right- 
angles to one another, the position of the second and third axes, which 
are diameters of the circle, may be decided arbitrarily. In a principal 
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component analysis a latent root specifies the length of an axis and a 
latent vector specifies its direction. 

The assumption of an ellipsoidal surface of equal frequency is often 
convenient, because it is implicit in the assumption that the distribution 
of the population from which the sample is drawn has a multivariate 
normal form. The multivariate normal distribution plays the same part 
in multivariate statistics as the univariate normal distribution in analysis 
of variance and the bivariate normal distribution in correlation analysis. 
Nevertheless, it is not essential to assume that the sample is drawn from 
a population which has an ellipsoidal contour of equal density. The 
components may be regarded as lines which, successively, provide the 
closest fit to the points which represent persons. The first component 
passes as close as possible to the members of the sample in the sense 
that the sum of squared deviations of person-points from the axis which 
represents the component is a minimum. Each succeeding component 
satisfies the same criterion, subject to the restriction that it is at right- 
angles to all preceding components. 

Why do we seek orthogonal reference axes? Obviously because we 
wish to refer persons to them, just as we impose a rectangular grid on 
a map of the British Isles so that we can specify any point in Britain by 
two numbers. In psychology, we wish to state the position of a person 
with the greatest possible simplicity, and this is achieved by giving his 
position on each member of an orthogonal set of dimensions. Such a 
statement is simple because there is one and only one number for each 
dimension, each number conveys a unique piece of information which 
is not conveyed by any of the other numbers, and the whole set of 
numbers contains all the information about the person which has been 
fed into the analysis. ° 

There is nothing in this formulation of the problem which suggests 
that some dimensions are more real, meaningful or important than 
others, and there is nothing which implies that some dimensions are 
dispensable or negligible. The existence of a component or dimension 
is a brute fact. If the dimension exists in the data, then it will appear in 
the analysis; if it does not exist, it will not appear. The research worker 
may decide to ignore certain dimensions, perhaps because they are 
artefacts, but the analysis cannot make the decision for him. In a long 
thin country like Chile it may be that, for many purposes, the latitude 
of a town is more important to the traveller than its longitude, in the 
sense that it conveys more information, because knowledge of latitude 
rules out more possibilities than knowledge of longitude. But, to the 
boatman travelling westward to the Pacific by river, a knowledge of his 
longitude conveys more information about his progress than a know- 
ledge of his latitude. 

The method of principal axes is not a statistical method leading to a 
decision on a hypothesis. It is a general method of displaying inter- 
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relations in the data. As such it forms an integral part of the statistical 
methods which are described in later chapters. The method is analogous 
to the logician’s technique of reducing a complex proposition to canoni- 
cal or normal form in order to exhibit its structure. Because principal 
component analysis is a formal rather than a material technique, we 
shall give a numerical example without any pretence that we are 
analysing actual data. 


An example 


The extraction of latent roots and their associated latent vectors or 
principal components is laborious to carry through by hand even if the 
matrix is as small as, say, 4 x 4,* but the principles behind the method 
are quite straightforward and may be illustrated in a simple arithmetical 
example. 
Suppose that we have given two tests to each of eight persons and 
have obtained the following n x f matrix of scores M. The scores are 
expressed as deviations from the test means: 


t tə Product 
Pi —8 | 8 
Pe 6 10 60 
Ps =f —10 20 
Pa 8 1 8 
5 0 3 0 
“Ps =% —6 36 
X 0 —3 0 
Ps 2 6 12 
SS 208 SS 292 SP 144 


The sums of squares and sums of products matrix W = M'M is 


208 144 


144 292 


and the cosine between the two 8 x 1 test vectors (which is equal to 


n 144 
the correlation between the two tests) is 208 X292 = 0-58. The total 


of the sums of squares of the two tests is 208 + 292 = 500. Within the 
limits set by this total, we wish to rotate each 8 x | test vector until 


* Two simple examples are worked through in detail in Hope, Elementary 
Statistics, 1967. 
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(a) the two resulting vectors (called components) are uncorrelated or 
orthogonal, and (b) the first component vector accounts for as much as 
possible of the total variance (and, by implication, the second accounts 
for as little as possible). To rotate the test vectors until they satisfy 
these two criteria, we multiply each person’s pair of scores first by one 
set of weights and then by another. To put it in matrix terms: we 
multiply the score matrix M by the weighting matrix F where F is 


of F and only later explain 


We shall first of all explore the properties 
ne on the first component is 


how it is derived. The position of person O 
calculated as 
(—8 x 0°6) + (—1 x 0:8) = —5-6 


component is calculated 


The position of the same person on the second c 


as 
(—8 x 0:8) + (-1 x —0:6) = —5°8 
e calculations for every person, We obtain the matrix C, 


Performing thes 
onents of the score matrix: 


which contains the two comp 


(A Cy Product ° 
Pi —5:6 —5:8 32-48 
Pe 11:6 —1-2 —13-92 
Ps 9:2 4-4 —40-48 
Pa 5:6 5:8 32:48 
Ps 2-4 —1:8 —4:32 
Ds —8-4 —1:2 10:08 
P —2:4 1-8 —4:32 
Ps 6-0 —2-0 —12-00 

ae 


ss 400 ss 100 SP 0 
The calculation of component scores is represented by the matrix equa- 
tion C = MF. 
ee we calculate the matrix of sums 0 
= matrix C—that is, C’C—we obtain 
ots A (capital lambda): 


f squares and sums of products 
the diagonal matrix of latent 
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Thus the total sum of squares (500) is more unequally divided between 
the components than it was between the tests, and the correlation be- 
tween the two components, unlike the correlation between the tests, is 
zero. The sum of squares of the first component (400) is known as the 
first latent root (A,), and the sum of squares of the second component 
(100) is known as the second latent root (Ag). (A is small lambda.) f, is 
called the first latent vector of W and f, is its second latent vector. 

If we imagine the two tests as the vertical and horizontal axes of an 
ordinary graph, we can plot the position of each person on the graph 
as a point. The person-points are assumed to lie in an ellipse (the ellipse 
Tepresents a contour of equal density of persons in the test space—we 
should have to graph data from many hundreds of persons before this 
became obvious). In rotating the tests to find the principal axes of the 


ellipse, the variance of the first component becomes greater than the 
variance of either of the two tests t: 
component is the sı 


to its circumference 


€ » the variance of the second component 
uses up all the variance left over from the first. 


i’ 0-6 0-8 f,’ 08 —0-6 

& Cy 
5-6 3-36 4-48 5:8 
11-6 6:96 9-28 —1-2 
—9-2 —5:52 —7:36 4-4 
5-6 3-36 4-48 5:8 
2-4 1-44 1-92 —158 
8-4 5-04 6:72 | 12 
—2-4 —1:44 —1-92 | 1-8 


6-0 3-60 4-80 =2:0 
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The reader may readily confirm that these two matrices sum to M. The 
matrix equation for this calculation is M = CF’. If we square every 
element in the first matrix and sum these squares, we find that they add 
up to the first latent root (A, = 400). The sum of squares of the second 
matrix is equal to the second latent root (A, = 100). 

. The above example differs from those most frequently reported onl; 
in that it is an analysis of unstandardized deviations, whereas pi 
reported analyses normalize the deviations (i.e. divide them by the 
square root of their sum of squares) before extracting latent roots and 
vectors. The reason for normalizing scores is that we usually have no 
grounds for supposing that the difference between the variances of any 
two tests represents a real difference in the importance of the functions 
which the tests measure. If one test has a higher sum of squares than 
another, it will be given greater weight in the analysis. To obviate this 
differential weighting, all the sums of squares are equalized by normaliz- 
ing every test. The sums of squares and sums of products matrix of the 


normalized scores is the correlation matrix R. Only in exceptional cir- 
cumstances is the analysis of the correlation matrix R identical, apart 
from a constant multiplier, with the analysis of W. The identity occurs 


only if every test has the same variance. , 
Apart from this difference in substance, all the other differences 
between this analysis and the analyses most commonly reported are 


specious rather than real. The reader is usually presented with the 
correlation matrix R (or the matrix W divided by the number of persons) 
and the so-called factor or component matrix F. He is told that the 
columns of F represent the components Or factors of the màtrix R (or 
W). This is quite true, but the analysis is of interest because it is an 
analysis of the score matrix M. W is calculated as M’'M only because 
the latent vectors of M’M, when multiplied by M, yield the components 


of M. 


Equations of the method 


The ‘characteristic equation’ which represents the extraction of latent 


Toots takes the form 
|W- Ad| =0 


n determinant mùst be zero for every 


This equation states that a certai! à 
the determinant of 


Value of À. In our example, this is 


208 144 
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which becomes, after the subtraction is performed, 


208-2; 144-0 


144 —0 292 — À; 


The determinant of this matrix must be zero: 
(208 — 4,)(292 — A) — (144 — 0)(144 — 0) = 0 
Multiplying out, we arrive at the equation 
A? — 500A + 40,000 = 0 
The values of A; which satisfy this equation are 


à, = 400 
Xe = 100 


The latent vector 


t which corresponds to a particular value of A; satisfies 
the equation 


(W — ADE, = 0 
In our example A, is 
1 0 400 0 
400 = 
0 1 0 400 


Setting out the equation in full for A, and f, we have 


208 144 | 400 0 0-6 0 


144 292 0 400 0-8 0 


which becomes 


—192 144 0-6 0 
144 —108 0:8 0 


The reader may care to set out the equation for the second latent root 
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and its associated weighting vector, and check that the equation reduces 
to a column of zeros. 

The columns of F are components of M only in the sense in which the 
weighting vectors used in making orthogonal comparisons in a factorial 
design in the analysis of variance (chapter 3) are themselves orthogonal 
comparisons. A weighting vector in a factorial design is a means for 
producing a particular orthogonal comparison. In an analogous way a 
column of matrix F is a means for producing a particular component. 

Of course, it does no harm to report only W (or R) and F in a research 
report, provided that the reader is aware of the background to these two 
matrices. It is seriously misleading to speak as if components are simply 
components of tests; they are components of the scores of persons on 
tests. And latent roots are the sums of the squares of such components. 

We have seen an actual example of a principal component analysis, 
and we have confirmed that the reported latent roots and latent vectors 
satisfy the equations 

|w—Ad| =0 
(W — Af; = 0 


But where did the equations come from? In geometry a vector is a 
quantity which has direction or orientation as well as magnitude or 
length. Multiplication of a vector by a matrix alters the direction of the 
vector, To put the same statement in algebraic terms, premultiplication 
of the vector x by the matrix A transforms vector x into another vector, 


let us say, y: 
Ax=y 


ransformation ‘to principal 


y is a transformation of x. In the case of a t 
and differs from it only by 


components, y has the same direction as x, 
the value of a constant multiplier À: 


Ax = x 
or, employing the symbols adopted in the foregoing example, 
A Wf = Af 
Which may be written 
wr — £=0 
or* 
(w — AME =0 


1. I is implicit in a matrix term, but 
Itiplication of any matrix by I leaves 
be made explicit if a term would 
= bla — 1). 


i Note the appearance of the unit matrix 
It is usually not made explicit because mu 
the matrix unchanged. I must, however, 
Otherwise cease to have matrix form; cf. (ab — b) 
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This equation has a solution other than f = 0 only if the determinant 
|W — Al| is zero. 

In order to interpret the equation Wf = Af geometrically, let us 
choose a point on the surface of an ellipse and imagine that a tangent 
to the ellipse has been drawn at that point. We may then construct the 
normal, which is a line perpendicular to the tangent. There are a few 
special points which are such that the normal has the same direction as 
the radius joining the centre of the ellipse to the point on the surface. 
The identity of direction between the normal and the radius defines a 
latent vector. 

Geometry, it has been said, is not true (or false), but convenient. 
Considered as a geometrical method, principal component analysis is 
not true or false, meaningful or meaningless, but simply a possible, and 
often convenient, transformation. Components, because they are un- 
correlated, are often easier to interpret than tests. In general, principal 
axis transformations are most useful when they are employed in con- 
junction with other methods such as multiple regression analysis and 
analysis of variance. A set of components is sometimes a helpful guide 
in the preliminary exploration of a complex field. Occasionally a well- 
planned principal component analysis is worth reporting in itself. 

It will be observed that the arithmetical values of the elements of a 


latent vector f; are not given by the above equations. The equation for 
f, would equally well be satisfied by the values 3 and 4, or by the values 
12 and 16. But the rat 


12 an B > io between the two elements is a constant, and it 
is this relative weighting of the variables which is determined by the 
equation. © 


Interpretation 
The matrix F might be represented in any of the following forms. To 


facilitate comparison, each form is bordered by the sums of squares of 
its rows and columns. 


Fa Fg Fc 
i & SS fi f, SS fi f, SS 
t, 12 8 | 208 | 12 8 1 12 8 1 
4/208 1/208 400 4/100 
t | 16-6 |22 | 16 __ 6 |i K É li 
$ 4/292 1/292 400 V100 
SS 400 100 1:57 043 1 1 
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If each element of Fc is expressed as a decimal, it will be seen that Fo 
is the matrix F which was employed above in the calculation of scores 
on components. Fc has the property that the sum of squares of every 
column is unity. In F4 the elements of each latent vector have been 
expressed in such a way that the sum of squares of a vector is equal to 
its associated latent root. Provided that all the latent vectors have been 
included in the matrix, the sum of squares of a row is equal to the sum 
of squares of the test in that row. 

In matrix Fg each element of a row has been divided by the square 
root of the test’s sum of squares. (The sum of squares for each test in 
an analysis of the correlation matrix R is unity.) By squaring the appro- 
priate element of Fg we obtain the proportion of the total variance of 
a test which is accounted for by a particular component. Thus the pro- 
portion of the variance of the first test which is accounted for by, or 
shared with, the first component is ons or 69 per cent. 

The matrix Fc, which contains the normalized latent vectors of the 
sums of squares and sums of products matrix W, has the sum of squares 
of each of its columns set to unity. By squaring the appropriate element 
of Fo we obtain the proportion of a component's variance which is 
accounted for by, or shared with, a particular test. The proportion of the 


first component of M which is attributable to the first test is 700 ° 


36 per cent. The calculation of simple proportions or percentages pro- 
vides a convenient and easily comprehended means of summarizing the 
conclusions of a multivariate analysis. If a test has a weight of 0-9 in a 
normalized component (that is, in a column of Fc), we know that 
81 per cent of the variance of the component is attributable to the test. 
If the weight is 0-3, then only 9 per cent of the component is accounted 
for by the test. . 

Fc is usually the easiest of the three matrices to interpret. The con- 
tribution of a test to a component can be immediately grasped by 
Squaring the appropriate element of the matrix. The contribution ofa 
component to a test is not so obvious. By concentrating exclusively on 
Fc it is possible to overlook such an interesting fact as that a particular 
test is largely accounted for by one of the later components. A com- 
prehensive analysis therefore involves both the calculation of Fo and 
also the calculation of the percentage of each test’s variance which is 


accounted for by each component. 


Analysis of normalized scores 


Let me repeat that Fe contains the normalized weighting vectors which, 
When applied to the rows of M, yield the components of the non- 
normalized scores. If each element of M is divided by the square root 
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of the sum of squares of its column, we obtain the normalized score 
matrix M,. We calculate the correlation matrix R by the formula 


M,’M, =R 


For the sake of completeness, we may report the normalized weighting 
vectors which, when applied to the rows of M,, yield the components 
of the normalized scores. The matrix Fp, which contains the latent 
vectors of the correlation matrix R, bears no simple relation to any of 
the F matrices derived from W. 


SS 1 1 


Although the columns of Fp contain the weighting vectors appro- 
priate to the normalized score matrix M,, in practice they are applied 
to the standardized scores. The reason is that, as the number of persons 
in the sample increases, the order of size of normalized scores decreases. 
Standardized scores, on the other hand, have a constant order of size 
whatever the size of the sample. A standardized score is obtained by 
dividing each element of M by the standard deviation of its column, 
that is, by the square root of the column sum of squares divided by the 
number of persons in the sample. Thus, instead of dividing each element 


a 2 , 
in the first column by 4/208, we divide by Je. Any standardized 


score may be obtained from the equivalent normalized score by simply 
multiplying the latter by y/n, where n is the sample size. The matrix of 
standardized scores is therefore equal to the matrix of normalized scores 
multiplied by the constant 4/7. The two matrices do not differ in their 
essential properties, and either may be used in place of the other. In 
theoretical discussions it is usual to employ the normalized matrix M,, 
but in actual calculations of scores on components, the standardized 
matrix is almost invariably employed. 


Correlations between vectors 


In the geometrical picture which was described at the beginning of the 
chapter, the tests were identified with the rectangular axes of co-ordinates 
in an ordinary graph. An alternative picture may be drawn, in which the 
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components are the rectangular axes of co-ordinates and the tests are 
lines at an acute or obtuse angle to one another. All components and 
tests may be represented as vectors crossing at a common point, which 
is conveniently taken as the origin of the co-ordinate system. The cor- 
relation between two vectors is the cosine of the angle made by those 
vectors at the origin. The correlation between a test t; and a component 
c; is calculated exactly as one would calculate the correlation between 
any two columns of scores. Both the test and the component are nor- 
malized by dividing each score by the square root of the column sum 
of squares. The correlation is the inner product of the two columns of 
normalized scores. The inner product of two vectors is simply the sum 
of products of corresponding elements. We therefore multiply the two 
normalized scores of a person and sum over all persons. If M, is the 
matrix of raw scores normalized by columns and C, is the matrix of 
component scores normalized by columns, then the matrix of all cor- 
relations between tests and components is M,C, It can be shown that 
this is the matrix Fg. Thus each element of Fg is a correlation or cosine 


between a test and a component. 
As an example, the angle between test one and the first component 


I ae 0:8321. If we look up this value in the table of 
4/208 
cosines which is given in most sets of four- or five-figure log tables, we 
find that it corresponds to an angle of 34°. Since there are only two 
components, and these are at right-angles to one another, we can see 
immediately that the angle between test one and the second component 
must be 90° — 34° = 56°. If the reader has a set of tables containing 
square roots and cosines, he will be able to check this value against the 
appropriate element of Fa. y 

The cosine between a test and a component is known as a direction 
cosine of the test. The tests, as they are reported in Fz, have two useful 
properties. They are all of unit length, and they are referred to a rec- 
tangular set of axes. Each component is an axis. It will be recalled that 
the formula for the cosine of an angle of a right-angled triangle is the 
ratio of the adjacent side to the hypotenuse. Each test vector 1s a 
hypotenuse of unit length. We may construct the perpendicular from 
a component vector to the outer end of a test vector, which gives us a 
right-angled triangle with a unit hypotenuse. The cosine of the angle 
between the test and the component at the origin is equal to the length 


of the adjacent side, that is, to the distance from the origin to the foot 
of the perpendicular. Thus, as well as being a direction cosine, each 


element of Fz is also a projection of a test on toa component. 

There is a simple general theorem of solid geometry which states that, 
if two vectors of unit length have been referred to rectangular axes, the 
cosine of the angle between the vectors is equal to the inner product of 


has the cosine 
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the direction cosines of the vectors. Thus the cosine between test one 
and test two is given by the inner product 


B.. _ 16), 8 o Ai e 
208 ^ 4/292) © (V208 ~ 4/292) ~ 208/292 


which is the same as the value calculated from the matrix of sums of 
Squares and sums of products. This angle is an angle in the complete 
space of the analysis. Often, however, certain dimensions are omitted 
from the analysis, in particular those associated with the smaller latent 
roots. We must therefore imagine the original correlation between a pair 
of tests represented in a space of fewer dimensions than the original 
space. The projection of the test vectors on to a subspace will probably 
distort the angle between the tests to some extent. The cosine of the 
angle between two tests in a space from which certain dimensions (com- 
ponents) have been deleted is obtained by calculating a new version of 
Fz. The sums of squares employed in the divisors are calculated from 
the retained components only. To take a trivial example, deleting f, 
from our example leaves us with only one dimension. The sum of squares 
of t; on the Tetained dimension is 144, and the sum of squares of t, is 
256. The correlation between the two tests in the one-dimensional space 
is therefore 
12 16 


via * 7256 =} 


The only possible correlations within a one-d 
and —1. 

Matrix F4 may be used to generate the matrices which Sir Cyril Burt 
has named unit hierarchies (Burt, 1938). The unit hierarchies of our 


example are calculated by multiplying each column of F4 by its trans- 
pose: 


imensional space are -+1 


12 16 8 ~6 Sin 
8 64 —48 | 208 144 
| 
—6 | -48 36) 144 292 


The unit hierarchies for all the components add up to the sums of 
squares and sums of products matrix W. Each unit hierarchy is a matrix 
containing only one component. The correlation between two tests on 
any set or subset of components may be obtained by adding together 
the sums of squares and sum of products of the tests in the appropriate 
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unit hierarchies. For example, the correlation between the two tests in 
the space of both of the components is 


192 — 48 
4/(144 + 64) +/(256 + 36) 


Orthogonality 

Matrix Fc, which was the matrix actually used in our calculation of 
scores on components, is an orthogonal matrix. In an orthogonal matrix 
the inner product of any row and any other row, or the inner product 
of any column and any other column, is zero, while the product of any 
row or column with itself is unity. This property is neatly summed up 
in the matrix equation 


FF = FF=I1 


which states that F (in the sense of Fc) postmultiplied or premultiplied 
o the unit matrix. It will be recalled that the 


by its transpose is equal t 
definition of an inverse or reciprocal matrix is that a matrix multiplied 


by its inverse is equal to the unit matrix 


FF = FF =I 


from which it follows that the transpose of an orthogonal matrix is 
equal to its inverse i 


|e 


in our example is identical with its own 


(It so happens that matrix Fe ] 
ent, and the reader should leave it out of 


transpose. This is a pure accidi 
account.) 

There are important similarities bet 
vectors and an orthogonal matrix © 
in the calculation of designed comparisons in 
The resemblance between F and the designe 
chapter 3 may be emphasized by adding to the latter a co! 
Positive elements. 


ween an orthogonal matrix of latent 
f weighting coefficients employed 
the analysis of variance. 
d comparison matrix of 
Jumn of all- 


F x y z 
0-6 0-8 0-5 0-5 0:5 
0-8 =06 05 —05 =05 


05 —05 05 


FFF 
AAnn 
(l 
° 
a 
© 
rn 
| 
© 
a 
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The difference between the two orthogonal weighting matrices (apart 
from the fact that one is 2 x 2 and the other is 4 x 4) is that the 
weights in F are empirically derived, whereas the weights in the designed 
comparison matrix are imposed a priori. A designed comparison matrix 
typically contains only simple numbers, whereas a matrix of latent vec- 
tors contains numbers which must be specified to three or four decimal 
places. The difference in appearance follows from a difference in func- 
tion. The purpose of a designed comparison is to impose a dimension 
a priori in order to estimate its sum of squares, whereas a latent vector 
defines that empirical dimension whose sum of squares happens to be 
a maximum. In spite of the different origins and aims of the two pro- 
cedures, it is conceptually convenient to ensure that the axes specified 
by both matrices form rectangular sets. Later chapters illustrate the 
value of employing a judicious combination of a priori and empirical 
dimensions in the interpretation of multivariate data. It is sufficient here 
to note the advantages which are conferred by orthogonality, whatever 
the derivation of the weighting vectors. 


There is one more variant of F to be considered and that is 


SS 
t 0-0073 
tə 0:0052 
SS 1 1 


If the columns of Fg are used to weight the scores in M, we obtain the 
set of component scores C,. The sum of squares of each component of 
this set is unity. In effect, the original frequency ellipse has become a 
circle. The same effect may be achieved by dividing each element of C; 
by the square root of the associated latent root.* 

From a purely mathematical point of view Fz is the most satisfactory 
of the weighting matrices. The reader will not be able to follow the 
reasoning behind this claim until he has become familiar with multiple 
regression methods and with the properties of powers of matrices. We 


* c; may contract and expand in the same way as a latent vector f,, Indeed, 
just as f; is a latent vector of (M’M — A,I)f; = 0, so c; is a latent vector of 
(MM’ — A,Ie; = 0. 
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shall, however, set out the arguments in this paragraph for the sake of 
completeness, and the reader may refer back to it when he has the 
knowledge necessary to understand it. The advantage of Fg is that it 
contains the latent vectors of W-1, which is the reciprocal of W. We 
may calculate W~ from Fe by setting out the matrix products of the 
latent vectors and summing them, as follows: 


= 1 [144192 | 
ai 760000 * |192 256 
ag 1 64 —48 
si 70000 * | _4g 36 
1 73 —36 
—l aa * 

W 10000 * 

36 2 


If we think of the first component as a criterion variable to be predicted 
from the test variables, we require the vector k which contains the co- 
variances between the tests and the component. The vector k for the 
ponent is the first column of F4, and so we calculate the 


first com 
regression equation Wk = bas 
wt k b’ 
1 73 —36 12 0:03 
10000 |—36 52| 16 0:04 


ation is the inner product of k and b. 


R? = |. The regression equation b is equal to the first column of Fz. 
The analogous calculation for the second component also yields an R? 
of unity and a regression equation equal to the second column of Fp. 
The values of R? are unity because each component is defined entirely 


by the tests and lies entirely within the test space. 
Fz in the calculation of component 


The argument in favour of using F; 
scores (or, equivalently, of normalizing the component scores ¢;) may 


be grasped intuitively. If a large proportion of the tests employed by the 
research worker consists of tests of a particular characteristic such as 
intelligence or social adjustment, then that characteristic will be counted 
several times over in the calculation of the scores of persons on the 


The square of the multiple correl 


E 
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latent vectors of Fc. By substituting Fg for Fo this redundancy is 
eliminated. 

The difference is of no practical importance so long as the components 
are considered singly. It is only when components are taken in conjunc- 
tion that it becomes necessary to decide whether we wish to eliminate or 
retain redundant variance. We may wish to quantify the similarity be- 
tween two persons by calculating the distance between them in the space 
of i components. This distance is calculated by subtracting the score of 
person A from the score of person B on each component, summing the 
squared differences over the i components, and taking the square root 
of the sum. As an example, the square of the distance between person 
two and person four in our data is 


(11-6 — 5-6)? + (—1-2 — 5-8)? = 85 


and the distance between them is therefore 4/85 = 9-2. In the matrix 
of normalized component scores C,,, each score on the first component 
is divided by ./A, = 20 and each score on the second component is 


divided by 4/A, = 10. The squared distance between the same two 
persons in the circular space of C,, is 


(0-58 — 0-28) + (—0-12 — 0-58)? = 0-58 


giving a distance of 0-76. The absolute difference in the order of size of 
these two distances is an artefact. It should be evident, however, that by 
altering the relative weights given to the components, we alter the rela- 
tive magnitudes of distances between persons. 
We may imagine a set of tests, of which twenty are mainly tests of 
aggression, while two are mainly tests of euthymia-dysthymia. From 
the analysis there emerge two components, each of which seems to be 
strongly related to one of these characteristics. It is probable that the 
latent root of the aggression component will be several times greater 
than the latent root of the euthymia component. We may decide to 
measure differences in personality by calculating the distances between 
persons in the subspace of the two components. Let us suppose that 
person A and person B obtain the same score on the aggression com- 
ponent, but very different scores on the euthymia component. It is 
obvious that our assessment of the similarity of the personalities of A 
and B will differ according to whether we weight the two components 
by their respective standard deviations or give them equal weights. 
Although it might seem reasonable either to give equal weights to the 
two dimensions or to impose some a priori weighting, in practice re- 
search workers, in so far as they have considered the question at all, 
have preferred to weight each component by its standard deviation. The 
justification for this rough-and-ready procedure is that the research 
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worker has introduced an implicit weighting by his decision to include, 
say, ten times as many tests of aggression as tests of euthymia. He has 
said, in effect, that he is primarily interested in aggression. He also feels 
that, having defined the aggression component by a large number of 
tests, he should be able to claim that he has averaged out much of the 
error variance of the individual tests, and so consigned it to subsequent 
components. It is right, therefore, to give a smaller weight to the later 
components, both because they are measuring dimensions of secondary 
interest, and because they may contain a greater proportion of error. 
The justification has been illustrated by a case in which each component 
may be identified as a measure of a particular characteristic. Such an 
identification is not essential to the validity of the argument. Although 
the justification may have some plausibility in the circumstances of some 
analyses, it certainly cannot be applied universally. if ithe choice of tests 
has been dictated solely by availability, and if the tests differ widely in 
their relative validities and reliabilities, then the empirically derived 
weighting is suspect. 

If component scores have been normalized, the weighting matrix for 
calculating the scores in M from the normalized scores C,, is the trans- 


pose of Fy. 


Components as variables 


It has been emphasized that simple percentages are often illuminating 
in the interpretation of a principal component analysis. This is true 
whether the percentages are percentages of the variance of components, 
tests or persons. Having decided on a weighting scheme for the com- 
ponents, the research worker should calculate the varidnce of each 
person and then calculate the percentage of the variance which is attri- 
butable to each component. The variance of a person is the square of 
his distance from the origin. Since the origin of every component is 
zero, this squared distance is the sum of squares of his component 
scores. 

If a satisfactory percentage of the total variance has been accounted 
for by the first few components, it is a common practice to neglect the 
remainder. It is often the case, however, that the bulk of the variance of 
a few persons is attributable to the neglected components. This may be 
a sign that some of the smaller dimensions, although they do not dis- 
criminate among the majority of the population, are particularly rele- 
vant to certain subclasses of people. It might be, for example, that a 
few persons of paranoid personality in a large sample of the normal 
population would display their abnormality on a component associated 
with one of the smaller latent roots. 

In the past the common practice has been to interpret a component 
by reference to the pattern of saturations in a column of matrix F. With 
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the coming of the computer, the research worker has other aids to 
interpretation at his disposal. He may, for example, check the naming 
of a component by looking at the component scores of persons who 
possess the characteristic which the component is supposed to measure. 
If he thinks that a component is a measure of extraversion-introversion, 
he should ask whether persons who seemed to him to be extraverted or 
introverted lie at appropriate ends of the dimension, while the rest of 
the sample lies somewhere in between. A more systematic check may 
be made by allotting each person to one of two or more groups on the 
basis of direct observation, and then testing the fit between the psycho- 
logical grouping and scores on the component. The test may be made 
by means of analysis of variance. The average score of the whole sample 
on any component is zero. If the component scores have been calculated 
as in our example, the total sum of squares of a component is equal to 
its associated latent root. If the component scores have been calculated 
by applying the normalized latent vectors of the correlation matrix to 
standardized scores, the total sum of squares of a component is equal 
to n times the associated latent root of R, where n is the total number 
of persons. The between sum of squares may be calculated by squaring 
the mean component score of each group, multiplying by the number 


in the group, and summing over all groups. The within sum of squares 
may be obtained by subtraction. 


The strength of the association between the grouping and the 
component may be expressed as an intraclass correlation coefficient 
(Snedecor, 1956), or as a correlation ratio. But analysis of variance is 
more suited to establishing that a component measures a characteristic 
than it is to estimating how well the characteristic is measured, Unless 
a very fine grouping has been employed, the correlation Coefficient may 
be considered only as setting a lower bound to the validity of the com- 
ponent. In practice, if an independent and uncontaminated measure of a 
characteristic is available, it would be as well to treat it as a criterion 
variable in a multiple regression analysis. The use of only two validation 
groups is not advisable except as a preliminary step, since a component, 
like any other variable, must be shown to be measuring the charac- 
teristic in question in every section of its length. 

For exploratory purposes, the more complicated designs in the 
analysis of variance may prove useful. The calculations remain simple 
so long as allotment of members to cells of the design is both exclusive 
and exhaustive. Few research workers have attempted a systematic 
study of the empirical relations between principal components and the 
main and interaction terms of a factorial design in the analysis of vari- 
ance. In a study of hostility (Hope, 1963) a clear association was found 
between a first principal component and a second-order interaction 
term in the analysis of variance. This finding indicates that a person’s 
position on the component is the outcome of the non-additive effects of 
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several factors, in the analysis of variance sense of ‘factor’. The second 
component of the same analysis was related to each of the three main 
terms, but not to the interaction terms; this suggests that the component 
is a measure of three factors which combine together additively. 

A component analysis provides a static map of the positions of per- 
sons in the factorial space. A dynamic interpretation would treat com- 
ponents as lines of thrust which combine additively, as in the parallelo- 
gram of forces. The position of a person would be regarded as a 
resultant of various pushes and pulls, each parallel to a different prin- 
cipal axis, just as the movement of a boat may be the resultant of a 
westerly wind and a southerly tide. The order in which the component 
thrusts are applied is irrelevant to the final position of the boat, or the 
person. Indeed, all the thrusts may be applied simultaneously. Attractive 
as the dynamic interpretation may be, it cannot usefully be applied 
unless it is possible to watch the persons taking up their positions on 
the components, as one can watch the moves in a game of chess. A 
principal component analysis is like a still aerial photograph of a foot- 
ball field during a game, with each player reduced to a speck, giving no 
hint of whether or whither he is moving. One cannot even be sure that 
the configuration of the players will remain constant, and so one cannot 
assume that a particular player will continue to occupy a particular 
position. If a set of components has been shown to exist in several 
different samples, we may assume the stability of the configuration and 
turn to the study of the movements of individual persons over time. 
The components will be valuable as axes of references. It does not 
follow that they will be useful as explanatory devices. Only if the 
empirically-derived components conform to dimensions hypothesized 
by the research worker will he feel justified in employing them as 


theoretical constructs. 


Sign reversal 

Because a latent vector is, in geometrical terms, simply a straight line, 
the pattern of the signs of its elements is arbitrary, in the sense that all 
these signs may be reversed without affecting any of the mathematical 
properties of the vector or of the matrix of vectors F. If one sign is 
reversed, then all must be reversed. Reversing the sign of a vector is 
equivalent to looking at the component from its opposite end—as dull- 


ness, for example, rather than intelligence, or as extraversion rather than 
introversion. Since a test is also represented geometrically as a straight 


line, it may undergo a similar transformation by reversal of the signs of 
all its elements in matrix F. Thus all the signs of any row or any column 
of F may be reversed without affecting any of its algebraic properties. 
The psychologist’s traditional distinction between a unipolar component 
(a column of F containing only positive elements or zeros) and a bipolar 
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component (a column of F containing both positive and negative 
elements) is a psychological, rather than a mathematical distinction. By 
a reversal of the signs of the appropriate tests a unipolar component can 
be turned into a bipolar component. Similarly, a bipolar component 
can be turned into a unipolar component. This is true of the weighting 
vectors f;. It is not true of the components c;, each of which must con- 
tain both positive and negative elements. The mean of each c; as of 
each test, is zero. 


Constraints 


In principal component analysis we usually obtain as many non-zero 
latent roots and vectors as there are tests. If the number of persons 7 is 
equal to or less than the number of tests, we obtain a maximum of 
n-1 components. , 

Factor analysts commonly object to an analysis which is derived from 
a correlation matrix with unities in the principal diagonal. Their objec- 
tions may be understood if we consider the trivial case of a test which 
has been twice administered to a sample of persons. The matrix 
M,'M,„ = R is then a 2 x 2 correlation matrix with unities in the 
diagonal and the self-correlation of the test in the off-diagonal cells. 
Unless the test is perfectly reliable, a principal component analysis of 
R will yield two components. The first gives equal positive weights to 
the two administrations, and the second gives a positive weight to one 
administration and an equal negative weight to the other administration. 
Clearly, the second component is a nonsense component. It emerges 
only because the test is less than perfectly reliable. 

In order*to eliminate the second component, the unities in the 
diagonal of R must be replaced by self-correlations, R is then a unit 
hierarchy containing only one component. Error variance, in the sense 
of that part of the variance of tests which is not replicable, may be 
eliminated from a matrix by subtracting the error variance of each test 
from the diagonal entry for that test, which represents its total variance, 
In the case of a correlation matrix, this is equivalent to replacing unity 
by the reliability of the test. ; oun 

Factor analysts are not content with the elimination of unreliability. 
They point out that part of the reliable or true variance of a test may be 
specific to that test, in the sense that it is not correlated with any part 
of the variance of any of the other tests which have been included in the 

matrix. One solution to the problem of estimating the specific variance 
of a test involves the calculation of the regression of that test on all the 
other tests in the analysis (chapter 6). The diagonal entry for the test is 
the square of the multiple regression coefficient. The reasoning behind 
this method is that the analysis should include only that variance which 
a test shares with some or all of the other tests, and the square of the 
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multiple correlation coefficient is the maximum proportion of the vari- 
ance of one variable which is predictable from any linear combination 
of the other variables. The geometry of the situation may clarify the 
argument. The tests are represented as a sheaf of vectors, in the usual 
manner. If a test lies entirely within the space of some or all of the other 
variables, then it has no specific variance. In so far as it lies at an angle 
to the space of the other variables, then it possesses specificity. The 
multiple correlation coefficient is the cosine of this angle. 

An alternative and rather roundabout approach involves reducing 
the rank of the correlation matrix. The rank of a matrix is equal to the 
number of its non-zero latent roots. The order of a (square) matrix is 
its number of rows and columns. It is observed that the elimination of 
error and specific variance reduces the rank of a matrix. It is then argued 
that reducing the rank of the matrix should remove its error and specific 
variance. The argument gains some plausibility when it is realized that, 
for each order of matrix, there is a minimum to the rank which can be 
attained by reduction of the diagonal elements (Ledermann, 1937). This 
minimum is a minimum in a rather special sense. The minimum rank of 
a matrix of order k is the rank which may be achieved, by suitable 
adjustment of the principal diagonal, in every matrix of order k, what- 
ever the actual numerical values in the off-diagonal cells. If these cor- 
relation coefficients are in fact subject to some empirical constraints, the 
attainable rank of their matrix may be less than the minimum. 

The problem of what values to insert in the principal diagonal of a 
correlation matrix is known as the communality problem; that is, it is 
concerned with estimating the proportion of the variance of a'test which 
is common or shared. The problem is discussed in all textbooks of factor 
analysis (e.g. Thomson, 1951). In practice the larger the order of the 
correlation matrix the less does it matter what values appear in the 
principal diagonal. In a 2 x 2 matrix the number of diagonal elements 


is 50 per cent of the total. In a 10 x 10 matrix it is 10 per cent. 


Dimensionality 
ts were introduced by relating them to the 


solution of linear equations. The nature of a determinant is considerably 
illuminated when we realize that it can be calculated as the product of 
all the latent roots of a matrix. Thus the determinant of the matrix W 


in this chapter is 


In chapter 1 determinan 


|W] = 400 x 100 = 40,000 


If we confine our discussion to the correlation matrix, we can envisage 


two extreme cases. The first is a correlation matrix which is a unit 
matrix; that is, no test correlates with any other test. The latent roots 
of such a matrix will all be unity and the product of all the roots will 
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also be unity. The research wi 
of correlation matrix becaus 
However, so long as the cor 


orker may be tempted to despise this type 
e it contains ‘non-significant’ correlations. 
relations are accurate estimates of the true 


instability, i bitrarines i i i 
l u S which can be Seriously mis- 
leading. This becomes evident j i i ii 


y One of the 
may take each test in turn, 


incipal components of the re- 

ppear different from analysis 
nents (one set for each pair of 
o-dimensional space, namely, 


Most discussions of the ‘significance’ of factor: 
been concerned with dimensionality, that is, wi 
number of dimensions which exist in the popu 


S Or Components have 
th the question of the 
lation from which the 
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sample of persons is drawn. In a sense this concern with the number of 
dimensions is misplaced. A population must include its samples, and 
any dimension of a sample must also exist in the population. There are 
therefore no grounds for questioning the existence of any component 
however small its latent root. A question which might be asked is 
whether a dimension of the sample represents a principal axis of the 
population. One way to establish that it does is to calculate the principal 
components of two random samples of persons drawn from the same 
population. The degree of resemblance between a component of sample 
X and a component of sample Y may be assessed by multiplying the 
normalized latent vectors (columns of Fc) and treating the product x’y 
as the cosine of the angle between the two components x and y. It has 
been shown that a set of tests may yield very high cosines even for 
vectors associated with the smallest latent roots (Hope, 1963; Slater, 
1964). 

When a factor analyst seeks to reduce the number of dimensions of 
his sample, he is in principle impugning the tests which he has employed. 
He is implicitly claiming that part of the variance of some or all of his 
tests is irrelevant, for one reason or another, to the subject-matter of 
his study. The question of relevance cannot be decided by mathematical 
considerations. Minimizing the rank of a matrix by the introduction of 
communalities into the principal diagonal cannot assure us either that 
we have omitted all irrelevant dimensions or that we have retained all 
relevant dimensions. Similarly the percentage of variance accounted for 
by a set of latent roots is not, in itself, a sufficient criterion of relevance. 
The only adequate safeguard against misleading and uninterpretable 


analyses is the employment of tests which are known to have a close 


association with the characteristics which they are designed to measure. 
ly, rare in the social 


Tests of suitably high validity are, unfortunate 


Sciences. 
Bartlett’s (1950; 1951a; 1951b) test of significance is concerned, not 


with dimensionality, but with sphericity. Let us suppose that we have 
analysed a 3 x 3 matrix, obtained from a sample of n = 100 persons, 


and found the following latent roots: 


à 8 
» 4 
w 2 


We test the null hypothesis that these roots are all of constant size by 
calculating the ratio of the geometrical mean of the roots to their 


arithmetic mean, 
ł 
pes (8 x 4 x 2)¥ _ 9.8571 
48 +442) 
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We raise r to the power of the number of roots, r? = 0:6295, and 
calculate 


2 


ll 


x? = —(n — t — }) log, r° 
—(100 — 3 — 4(—0-4628) 


= 44-66 


Il 


We let k be the number of roots which have been omitted from the 


analysis. Since all the roots are contributing to this chi square k = 0. 
The degrees of freedom are 


d.f. = (t — k + 2)(t — k — 1) 
= 33 —0 + 2)3 —0—1) 
=5 


Since the chi square is significant, we have established that the latent 


roots cannot all be the same size; i.e. that the component space cannot 
be regarded as a sphere. 


Now we wish to test w 
where among the roots is 
and subsequent roots. In 
k to 1, and calculate a ra 


hether the heterogeneity which exists some- 
to be found, at least in part, between the first 
order to do this, we delete the first root, set 
tio for the second and subsequent roots: 


2442) 3 
7 a 
x = —(n—1— 4) log, r? 
= —(100 — 3 — 4)(—0-11779) 
= 11-37 


df = t= k + 21 k1) 
=16-1426-—1-1 
=2 


Subtracting the second chi square from the 


b first, and similarly subtract- 
ing the degrees of freedom, gives us 


xX = 33-29 
af. = 3 
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The subtracted chi square is significant, and so we conclude that the gap 
between À, and A, contributes to the heterogeneity of the roots. x? = 
11-37 is a test of whether there is evidence of heterogeneity among the 
second and subsequent roots. A subtracted chi square may be calculated 
for each root in turn. 

The ratio which is employed in the test of significance of all the roots 


may be written 


where the latent roots are summed over all the roots, and |W| is the 
determinant of the sums of squares and sums of products matrix. If the 
normalized score matrix M,„ has been analysed, then the numerator of 
the ratio is the determinant of the correlation matrix. The determinant 
of a matrix may be thought of as a measure related to the volume of 
space enclosed by the ellipsoid. In effect, the test calculates the ratio of 
the actual to the maximum possible volume of the ellipsoid, given the 
available length of axes. The maximum possible volume occurs when 
the axes are all equal and the ellipsoid has become a sphere. In this case 
the geometrical mean of the axes is equal to their arithmetic mean, the 
ratio is unity, and chi square is zero. 

It is sometimes argued that, if two roots are so close together that 
they do not yield a significant chi square, one is not justified in inter- 
preting the associated latent vectors. The force of the argument lies in 
the fact that, if a plane of the space is circular (like the waist-section of 
a rugby ball), the two latent vectors which are diameters of the circle 
may take any direction in the plane, so long as they remain at right- 
angles to one another. On the other hand, it is possible to find two 
neighbouring roots which are so similar in size that the kth root of 
sample X is the k + 1th root of sample Y, and the kth root of sample 
Y is the k + Ith root of sample X. The equivalence of the roots is 
demonstrated by calculating the cosines between their corresponding 
vectors. When such a situation occurs, it is possible to ‘accuse’ the 
sampling and to say that one or both samples is biased. Nevertheless, 
the occasional appearance of such anomalous cases is a salutary warning 
against putting too much emphasis on questions of sphericity. 

The overall chi square test (Mauchly, 1940) of the whole set of roots 
(in our example, x? = 44°66) has different interpretations according to 
whether it is applied to the correlation matrix or to the dispersion matrix 


(the test of the sums of squares and sums of products matrix W is 
. 1 , 
equivalent to a test of the dispersion matrix V= TNA Applied to 
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the correlation matrix, the test asks whether the roots show any ten- 
dency to deviate from a uniform value of unity. If they do not, then the 
correlations among the tests cannot be assumed to differ from zero, and 
there is no point in transforming them to uncorrelated components. If 
the test of significance has been applied to W or to the dispersion matrix, 
it is a test of the independence of the variables and also a test of the 
equality of their variances, 

The chi square test of the k + Ith and subsequent roots may be 
thought of as an assay of the independence, or independence and 
equality of variances, in the residual matrix, which remains after the 
removal from R or W of the unit hierarchies accounted for by the first 
k components. The hypothesis which it tests is very like that tested by 
the older factor analysts, who made rule-of-thumb estimates of the 
significance of residual correlation matrices before deciding whether or 
not to extract further factors. 

Lawley (1956; Lawley and Maxwell, 1963) has indicated how the 
multiplier for L, which was given above as —(n — t — }), may be 
modified in order to improve the chi square approximation. 


Summary 


The reader has been presented with a 
algebraic relations, which may have left hi 
summarize the salient properties of the method of principal components. 

(1) The components, in the sense of columns of component scores ci, 
are at right-angles to one another. 

(2) The sum of squares of scores on the first component is a maxi- 
mum, and hence the sum of squared deviations from that com- 
ponent is a minimum. The sum of squares is a latent root, ora 
latent root multiplied by the number of persons in the sample. 

(3) When person-points are projected on to the hyperplane which is 
perpendicular to the first component, the second latent root and 
latent vector have the same properties in this hyperplane as had 
the first root and vector in the whole space; and so on for each 
latent root and latent vector in turn. 

(4) For some purposes it is convenient to think of the tests as axes 
at right-angles to one another. If the distribution is multivariate 
normal, the person-points should then lie in an ellipsoidal swarm. 
The components are axes of the ellipsoidal surface. The square 
root of a latent root is the distance from the centre to the surface 
along the line of the appropriate axis. The product of the latent 
roots is related to the volume of the ellipsoid. Persons may be 
Tepresented as points spreading out from the centre of the ellip- 


soidal figure. The figure may be squashed into a sphere by equat- 
ing the lengths of the axes. 


succession of arithmetic and 
m bewildered. Let us therefore 
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(5) For other purposes it is convenient to think of the tests and 
components as a bunch of lines crossing at a common origin. The 
cosine of the angle between two lines is the correlation between 
the tests or components represented by the lines. 
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Chapter 5 Spherical Maps 


The interpretation of many multivariate analyses may be facilitated by 
mapping vectors in the whole space of the analysis, or in some sub- 
space. The general principles of mapping may be illustrated by describ- 
ing the use to which Dr Patrick Slater has put it in representing the 
results of a principal component analysis of a repertory grid. The con- 
struction of maps is sometimes referred to as polar co-ordinate analysis. 

In a typical repertory grid experiment, a single person is asked. to 
formulate a number of constructs which inform his attitudes towards 
other people. He is then asked to rate persons known to him on these 
constructs. Any kind of rating technique may be employed, so long as 
the responses can be expressed numerically. The persons rated are 


known as the elements of the grid. A construct is the equivalent of a 
test in chapter 4, 


Let us suppose that we are inve 
members of her family, 
terms she uses to descri 


stigating a woman’s attitudes to male 
and that we invite her to suggest the sort of 
be them. She suggests four adjectives: strong, 
sympathetic, aggressive and stingy. Taking each construct in turn, we 
ask the woman to give a rating on a five-point scale to each man on that 
construct.’°The ratings are reported in the following matrix. A numeri- 
cally high value means that the woman attributes the named end of the 
construct to’ the element. 


Husband Uncle Brother Father Son Mean 
Strong 4 3 1 2 5 3 
Sympathetic 4 3 3 1 4 3 
Aggressive 4 2 1 2 1 2 
Stingy 1 1 0 2 1 1 


We now ‘centre the matrix for constructs’; that is, we express each 
Tating as a deviation from the mean for that construct. In the terms of 
the previous chapter, this gives us matrix M’. 


Husband Uncle Brother Father Son 


Strong 1 0 —2 —1 2 
Sympathetic 1 0 0 —2 1 
Aggressive 2 0 | 0 = 
Stingy 0 0 —1 1 0 


q 
7 
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The sums of squares and sums of products matrix for constructs is 
MM = W. 


10 5 2 1 
6 1 —2 

6 1 

2 


It is noteworthy that no attempt was made to constrain the woman 
ry construct. It seems that she is par- 


to give the same variance on eve 
ticularly discriminating in her attribution of strength or weakness to her 


menfolk. There is no reason to distort this finding by analysing the 
correlation matrix, in which all constructs have the same variance, and 
so we extract the components for the matrix containing the sums of 
squares and sums of products. The weighting vectors which, when ap- 
plied to the ratings centred for constructs, yield the components, are 
reported here in non-normalized form. The sum of squares of each 
vector is set equal to its associated latent root, so that the sum of squares 
of a row is equal to the sum of squares of the corresponding test. The 


Matrix is F4 of the previous chapter. 


fı fz f, SS 
Strong 3 0 1 10 
Sympathetic 4 = =! 6 
Aggressive 1 2 =l 6 
My A ee "a 
Latent root 14 6 4 


Vv The fourth root has disappeared. If we normalize the pee 
‘tors and apply them to the ratings expressed as deviations, 


t 
Obtaj S 
btain the orthogonal components (matrix ©): 


& Ce cs 
Husband 1-87 1:22 Pa 
Uncle 0 0 1-00 
Brother —1-87 —1:22 <i ni 
Father —1:87 1:22 
Son 187 122 100 
6-00 4:00 


Latent root 1400 
The husband appears as a strong character whose aggressiveness 18 
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mitigated somewhat by his sympathy. The son is also seen as a strong 
character; he is sympathetic but not aggressive. With the help of the 
woman’s introspections, we might interpret the first component as a 
dimension of strength of character, giving rather more weight to sym- 
pathy than to aggressiveness. The second component contrasts pacific 
persons, who tend to be sympathetic (son) or generous (brother), with 
aggressive persons (husband) or stingy and unsympathetic persons 
(father). The third component suggests a dimension of strength which 
is related to stinginess, but opposed to both aggression and sympathy. 
It happens that every person except the uncle has the same proportion 
of the variance. The uncle has none at all (this accounts for the zero 
root). 

In order to represent the constructs (i.e. tests or variables) graphically, 
we picture them as vectors springing from the centre of a sphere such as 
a terrestrial globe. To accomplish this, we must first of all imagine the 
globe cut in half along the line of the equator, as one would cut a 
grapefruit. We choose a point on the equatorial circle (that is, on the 
upper rim of the rind of the grapefruit) as the place where our Green- 
wich line is to meet the equator. This is the point of 0° longitude and 
0° latitude. We draw the diameter of the circle which meets the surface 
at this point, and we call it the first component. We draw the diameter 
at right-angles to the first component, and call it the second component. 
The positive end of the second component meets the equator at 90° 
longitude. The position of the equatorial section (‘section’ in the sense 
of cut), and the positions of the component diameters are quite arbi- 
trary. Unlike a grapefruit or the globe of the earth, both of which have 
obvious polar points, our globe is an undifferentiated sphere, and we 
may choose any one of an infinity of possible equators. Similarly, we 
may choose any point on the equator to be 0° longitude. Having made 
these two choices, the position of the second component is not a matter 
for choice. It must be a diameter of the equatorial circle, and it must be 
at right-angles to the first component-diameter. The only decision to be 
made about the second component is which end of the diameter shall 
represent the positive end of the component. Convention suggests that 
the positive end should be longitude 90° east. 

The reader who does not possess a terrestrial globe may care to 
consult the circular maps of the whole world which are usually to be 

found at the beginning of an atlas. He must imagine the first component 
springing from the centre of the earth and reaching the surface at 0°, 
0° (the point where the Greenwich line meets the equator), and the 
second component meeting the surface at 90° east on the equator (a 
Point some way west of Sumatra). Our purpose is to represent every 
test or construct, and also every component, as a diameter of the earth. 
All diameters cross at a common origin, and the origin is at the centre 
of the earth. A person who scores nought on every centred test, and 
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hence on every component, lies at the centre of the earth, because he is 
at the mid-point of every diameter. The Uncle in the repertory grid lies 
exactly at the origin. 

Now we are in a position to plot the angle between each test and the 
first component within a plane of our total space. The plane is the 
equatorial plane whose axes are the first two components. We know 
that our test vectors will not lie wholly within this plane, because none 
of them is entirely accounted for by the first two components. We can, 
however, imagine a test projecting out from the centre of the grapefruit- 
half like the gnomon of a sundial, and, with the sun directly overhead, 
casting a shadow on the equatorial plane. Our first task is to calculate 
the angle between the shadow, which is the projection of the test on the 
plane of the first two components, and the second component. The 
cosine of the angle, in the equatorial plane, between Strong and the 
second component is 

0 


Vero)” 


which corresponds to an angle of 90°. We know that the sum of the 
angles between a test projection and its rectangular axes must be a 
right-angle and so, by subtracting the angle between a test projection 
and the second component from the value of a right-angle, we obtain 
the angle between the test projection and the first component. In our 
example the angle between Strong and the first component is 90° — 90° 
= 0°. The cosine of the angle between Sympathetic and the second 
component, within the plane of the first two components, is 


pe GS 
Vee CM 


of 63°. Thus the angle between Sym- 


whict le 
h corresponds to an ang or 908 63° = 27°. The nce seu 


Pathetic and the first componen 7 
angles, which are angles of longitude, is 


Strong 0° 
Sympathetic Ar 
Aggressive 63° 
Stingy 90° 


It will be observed that Sympathetic has a negative angle because it is 
Associated with the negative end of the second component. We decided, 


arbitrarily, to nominate 90° east as the positive pole of the component. 


We therefore say that Sympathetic lies 27° west of the Greenwich line, 


Strong lies on the line, and Aggressive and Stingy lie to the east. 
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It may be asked why we did not calculate the cosine between each 
test projection and the first component directly, instead of first calculat- 
ing the angle between a test and the second component and then sub- 
tracting this angle from 90°. The reason is that the indirect method 
simplifies the introduction of a third and subsequent dimensions. So far 
we have been examining the two-dimensional space of the equatorial 
plane. Obviously the third component, which must be at right-angles to 
the first two components, is the diameter of the globe which joins the 
North and South Poles. We have located each test along the line of the 
equator; now our problem is to find how many degrees each test lies 
above or below the equator. In other words, we must calculate the lati- 
tude of each test. We may, arbitrarily, decide that the positive end of 
the third component is the northern end, so that a positive association 
with the third component puts a test in the northern hemisphere, and a 
negative association puts it in the southern hemisphere. 

Now we are no longer working in a two-dimensional circle, but in a 
three-dimensional sphere. We have, as it were, put the top half of the 
grapefruit back on to the bottom half. A test vector may meet the 
spherical surface at any point. The angle of latitude of a test is the angle 
between the test diameter and the equatorial plane. We have already 
seen how a test projects on to this plane like a shadow cast by the sun. 
The shadow is in the plane of the equator and the north-south polar 
line 1$ at right-angles to that plane. Hence the angle between the pro- 
Jection of the test and the third component is 90°. We now wish to 
determine the angle between the projection-shadow and the test proper. 
This may be done by calculating the angle between the test and the third 
component and subtracting it from 90°. The cosine of the angle between 
Strong and f in the space of the first three components is 


I 


aaa 033162 
VO HOE FTS 


Which corresponds to an angle of 72°. The angle of latitude between 
Strong and the equatorial plane is therefore 90° — 72° 


has a positive association with the third component and sọ it lies 18° 
north of the equator. The complete set of angles may be written 


DIA D, Di — D; 


Strong 0° 18°N 
Sympathetic 27W 24°S 
Aggressiye 63°E 24°S 
Stingy 90°E 45°N 


The column headings are intended to indicate a step-by-step progression 


= 18°. Strong . 


a 


i 
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through the dimensions. We begin with the first dimension D;, which is 
the dimension of the first component. It will be recalled from the last 
chapter that the angle between two vectors in a one-dimensional space 
must always be 0° or 180°. We add the second dimension D, and cal- 
culate the angular deviation from D, which this introduces. Then we 
add Ds, and calculate the angular deviation from the two-dimensional 
space D,,. which this introduces. The calculations may be carried into 
any number of dimensions. The number of possible columns of angles 
is one less than the number of non-zero latent roots. 


Mapping the tests 


ose principal components have been extracted 
-zero roots. This means that the tests or con- 
mapped in the three-dimensional space of an 
globe is available, the tests may be plotted 
as points on its surface. If the spherical surface is to be projected on to 
the plane of a sheet of paper, it is advisable to ensure that all the tests 
fall into one hemisphere of the globe. In the repertory grid example the 
problem does not arise. No test has a negative association with the first 
component. If, therefore, the centre of the sheet of paper is taken as 
0°, 0°, all the constructs will lie within a circle of 90° radius whose 
centre is at this point. If a test has a negative weight in fj, the signs of 
all the weights of that test should be reversed. Reversal of signs brings 
the test into the same hemisphere as the other tests. It must be borne in 
mind that sign-reversal is equivalent to plotting the negative end of the 
test on the spherical surface. i 

There are various ways of projecting the spherical surface of a globe 
on to the plane of a sheet of paper. No one projection is ideal for all 
purposes, and all projections involve some distortion. When the research 
worker has decided on an appropriate projection, he must search an 
atlas for a map which uses that projection, and mark his tests on a piece 
of tracing paper laid over the map. He must remember to report the 
Projection he has used, so that the reader can project the map back on 
to a three-dimensional surface. If the area of the map does not extend 
90° from 0°, 0° in every direction, this too must be stated. Reversal of 
Ne signs of a test may be indicated by putting a minus sign against its 

ame. 

Tt must be borne in mind that a physical representation can be made 
Of only three components at a time. The cosine between two test vectors 
is equal to the correlation of the corresponding tests only if that cor- 
telation is entirely accounted for by the three components which are 
Tepresented in the globe. A test that appears, from the map, to belong 
to a certain cluster of tests may in fact have a considerable proportion 


In the repertory grid wh 
there are only three non: 
structs can be completely 
actual sphere. If a blackboard 
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of its variance accounted for by a fourth dimension which has neces- 
sarily been omitted from the chart. For this reason it is essential to state 
the proportion of the variance of each test that is accounted for by the 
three components which define the map. 


Interpretation 


Although the three-dimensional mapping of tests has not been widely 
used, there is good reason to suppose that the calculation of angles of 
longitude and latitude should become part of the routine of principal 
component analysis. No doubt the geographical representation of angu- 
lar distances is more appropriate to some analyses than to others. It is 
probably more appropriate in analyses of the correlation matrix R than 
in analyses of the sums of Squares and sums of products matrix W. In 
most analyses the research worker normalizes his tests so that every 
test has a length of unity. It is then reasonable to represent the test 
vectors by the set of points at which they meet the surface of a sphere 
of unit radius. If the research worker has chosen to analyse W, he has 
usually had some special reason for avoiding normalization. He has, as 
it were, built into the analysis his conviction that the tests are vectors of 

Onstructs of the repertory grid, for example, were 
se it was supposed that the variance of a construct 
was a measure of its discriminating power in the mind of the woman. 
Tf the test vectors are to be allowed to have differing lengths, the black- 
board globe must be replaced by a wire model, in which each component 
and each test is represented by a straight piece of wire, all the pieces 
crossing at a common origin. It is then possible to model the structure 
of the analysis faithfully by making the distance from the origin to the 
end of a wire proportional to the square root of the variance of the test 
which it represents. 

If any variance has to be omitted from a spherical map, it is desirable 
that a roughly equal proportion of the variance of each test should be 
lost by the omission. In an analysis of a 34 x 34 correlation matrix, 
the author found that the first three roots accounted for 75 per cent of 
the total variance. There was reason to Suppose that a good part of the 
remaining 25 per cent was unrelated to the purpose of the analysis. The 
proportions of the variance of the individual tests which were accounted 
for by the first three roots did not deviate greatly from their average of 
15 per cent. The only contra-indication to the drawing of the map was 
the fact that one or two persons related more strongly to some of the 


later components than they did to the first three. This fact was reported 
(Hope, 1968c), 
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Angles between persons 


There is no objection, in principle, to the polar co-ordinate analysis of 
persons. A person may be represented by a vector joining his point to 
the origin. His angular separation from other persons may be calculated 
and plotted in a space whose rectangular axes are the components. Un- 
like a test vector, a person vector is a radius, rather than a diameter, of 
the sphere. It is not possible to represent all the persons in an analysis © 
on one hemisphere of a globe. A double map will therefore be necessary 
for any two-dimensional representation. In practice, a polar co-ordinate 
analysis of persons will rarely be required, because persons are rarely 
normalized. The person-points are, in fact, at varying distances from 
the origin, and in most researches it is preferable not to disguise this 
feature of the data by projecting all the points on to a common sphere. 
Nevertheless, there may, in some types of work, be some conceptual 
gain in thinking of persons in terms of their angular separation. In the 
last chapter the degree of similarity or difference between two persons 
was quantified by calculating the distance between their two points. If 
the calculation of distance is combined with the calculation of angular 
Separation, it is possible to identify persons as opposites and to state the 
degree to which they are opposed. It must be understood, of course, that 
both opposition and difference are defined in the terms of the selection 
of persons, tests and components which has been made. Two persons 
who lie on opposite sides of the origin, with an angle of 180° between 
them, may be regarded as opposites, the extent of their opposition being 
estimated by the distance between them. { 

Like most terms in everyday use, ‘opposition’ or ‘oppositeness’ can 
mean many things. Nevertheless, we are often faced with problems of 
choosing opposites. Is personality type A opposite to type B, or to type 
C? Are the three types, perhaps, separated by angles of 120°? Such a 
question might be approached by means of canonical analysis of dis- 
Criminance (see chapter 7). Canonical methods, however, would distort 
the space of the analysis in the interests of reducing each personality 
type to an approximately circular or spherical distribution. The experi- 
menter may feel that he himself wishes to state the terms within which 


the distribution of personality types is to be examined. In practice, this 


Means that he must make a careful selection of tests, and impose a non- 
identity and weight of the com- 


Mathematical decision on the number, 
Ponents which are to define the space. The distance between types, and 
their angular separation, would be calculated from the mean of each 
type in the space. The problem of relating persons, and clusters of 
persons, to one another will not be treated here. It belongs to the field 


of numerical taxonomy, which is outside the scope of this book. 
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Advantages 


The projection of vectors in three dimensions on to a plane surface is an 
essentially simple procedure, which will prove of considerable benefit in 
many Studies. Its virtue is that all the relations among the variables in 
the three-dimensional space become immediately and simultaneously 
visible. The intermediate stage of representing the variables as points on 
the surface of an actual globe should not be omitted if the relations are 
at all complicated. 

It is sometimes stated that the principal component solution is not 
unique. Mathematically, this is untrue, Any ellipsoid which has not 
degenerated into a sphere has a unique set of principal axes. But, it is 
Said, mathematical considerations cannot dictate the nature of reality. 
The position of a principal axis can be altered by simply adding or 
dropping a test. The fact that an axis has a maximized standard dey: 
tion is interesting, but it does not guarantee the purity or meaningful- 
ness of the dimension. It may be compounded of any number of 


characteristics, just as we Suppose a test to be compounded of any 
number of components. When we 


not give first the di i 


breadth and height 
one another. 

So, it is argued, we may, and indeed should, make our components 
conform to a structure which appears reasonable on non-mathematical 
grounds. The force of the argument is greatly reduced if we substitute 
for boxes or medicine bottles (two examples which have actually been 
submitted to illustrative analyses by psychologists) pebbles or waves. 
The mind, as Kant said in the © ritique of Judgment (§68), has a par- 
ticular insight into artefacts and engines which it has itself constructed. 
This is one reason why we design analogue machines. The dimensions 
which we use in measuring an artefact are the dimensions which we 
imposed upon it when we conceived and made it. It may well be that 
minds, like pebbles and waves, have a definite structure, but it does not 
follow that there must exist dimensions which have the privilege of being 
primary, pure or meaningful. The multiplicity of dimensions invented 
by novelists and psychologists tells against such a supposition. In the 
analysis of psychological material we are in a Situation of conceptual 
weightlessness, complicated by the fact that we do not know how 
dimensions we may expect to find in the data. 

The search for basic factors may involve rotation of the principal 
components. The research worker's state of disorientation is rarely im- 
proved by rotation. Having extracted a seemingly fixed set of reference 
axes from the cloud of data, he watches anxiously as they begin to move 


, Which are, in any sample of boxes, correlated with 


many 


Spherical Maps 77 
through space at the behest of some arithmetic criterion. Even when he 
himself dictates the conditions of the rotation, he is conscious that each 
decision he makes may be influenced by subjective biases. The reader 
of research reports which rely on rotation is often provided with in- 
sufficient information to satisfy him that he would approve of all the 
decisions which the writer, or the computer, has taken. It has, unfortu- 
nately, become standard practice to report rotated components without 
reporting either the unrotated components or the original correlation 
matrix. A 

The representation of the set of tests as a set of points on the surface 
of a globe avoids the pitfalls of rotation, while retaining its advantages. 
Rotation, as‘ordinarily practised, is rotation of components relative to 
tests. In the case of rotation to correlated axes (see the following section) 
it also involves rotation of components relative to one another. Rota- 
tion, in the sense in which it may be applied to a polar co-ordinate 


analysis, is rotation of the whole space with respect to the observer. The 


tests and the components form a rigid system. No vector moves 1n 


relation to any other vector. But, by picking up the globe, turning it 
Over, and spinning it on any axis, the observer may perform any rotation 
he pleases, immediately and without making calculations. He may pro- 
ject the surface on to a map from any point of view he chooses. If, for 
example, the test points fall into discrete clusters, he may imagine the 
centre point of each cluster joined to the origin as a vector, and he 
may give the vector a name. The centre of the map may then be chosen 
So as to exhibit the clusters with the greatest clarity and the least dis- 
tortion. ; 

Factor analysis is sometimes represented as the rotation of test vectors 
to principal components, followed by the rotation of components to 
meaningful factors or axes. The advantage of a spherical map is that all 
three sets of vectors can be represented simultaneously in a completely 
rigid structure. The element of subjective judgment 1s eliminated from 
the analysis proper, but it is allowed free play at the conclusion of the 
calculations, when all the information 1s available in summary form. 

Although the use of spherical maps is at present limited by the 
Testriction to three dimensions which applies to all physical analogues, 
the introduction of sophisticated visual output devices in computer 
laboratories will, in the future, allow the research worker to project 
More than three components simultaneously. For example, six com- 
Ponents may be represented on ten three-dimensional stereoscopically- - 
Presented spheres. The observer may rotate any sphere at will, and the 
Machine will automatically ensure that the remaining spheres move in 
sympathy. If the facilities are available, it should be possible to picture 
the vectors as lines, rather than as points on a surface. In this way the 
Proportion of the variance of a test lying within any sphere may be 
indicated by the length of that test’s vector. Similar visual displays may 
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be made of the vectors derived from the more complex types of analysis 
described in later chapters. 


Simple structure 


No attempt will be made here to describe the arithmetic complexities of 
Thurstone’s (1947) concept of simple structure. But a discussion of the 
method of polar co-ordinate representation of tests naturally raises the 
question of the relation between simple structure and the spherical map. 
A careful reading of Thurstone’s book on multiple-factor analysis sug- 
gests that simple structure is in origin a geometrical conception, and 
that its continuing fascination is a consequence of the neatness and 
simplicity of the geometrical figure which defines it. In discussing the 
geometrical model, we shall follow Thurstone’s example and confine 
Ourselves to a three-dimensional space. 

By joining three points on the surface of a sphere, each of which is 
Separated from the other two by angular distances of 90°, we obtain a 
right-angled spherical triangle. Each side of the triangle is a great circle 
of the spherical surface. A sufficient, and perhaps definitive, condition of 
the presence of orthogonal simple structure is that all tests should lie 
on or near the sides of sucha triangle. The triangle may lie anywhere on 
the surface of the sphere. By joining each vertex (corner) of the triangle 
to the centre of the sphere, we obtain three orthogonal reference axes, 
These axes are such that any test on side a of the triangle has a zero 
correlation with the reference axis which joins the origin to the opposite 
vertex A. Each test lies in the plane of two reference axes and, if these 
axes are taken to represent factors, may be explained by not more than 
two factors. 

The concept of simple structure, considered geometrically, is clear and 
unambiguous. Problems arise only when the pattern which it imposes 
has to be fitted arithmetically to untidy real data. The geometrical pic- 
ture of orthogonal simple structure at once suggests the possibility, and 
indeed the probability, of non-orthogonal (oblique) simple structure. 
Suppose that a triangle exists, in the sense that all the tests lie on the 
sides of a triangular configuration, but the length of any or all of the 
sides is not equal to 90° on the spherical surface. In this situation 
Thurstone identifies his ‘primary vectors which represent the three 
underlying factors or parameters in pure form’ as the vectors joining 

the centre of the sphere to the vertices of the triangle. In a typical 
analysis these primary vectors are positively correlated with one an- 
other, because the triangle which defines them has sides which are 
smaller than 90°. Thus a test which lies on side a will be positively cor- 
telated with primary factor A. Factor A is not, however, necessary to 
the description or specification of the test, because the latter lies in a 
plane defined by factors B and C. In order to eliminate A from the 
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specification of the test, a reference axis A’ is supplied. A’ is defined as 
the diameter of the sphere which is at right-angles to the plane defined 
by factors B and C. Any test lying in this plane necessarily has a zero 
Weight or saturation in A’. Two other reference vectors, similarly de- 
fined, are associated with the other two primary factors. The reference 
axes A’, B’ and C’ are the ‘rotated factors’ which are the usual con- 
clusion of a rotation to oblique simple structure. If the primary factors 
are orthogonal to one another, each primary factor coincides with its 
associated reference axis. ‘ 
After illustrating the nature of simple structure by an analysis of 
measurements of boxes, Thurstone says, “There is no guarantee before- 
hand that a simple configuration like figure 11 can be found in any 
given problem, and there is no guarantee that a clarifying description of 
the factors can be obtained.’ Implicit in this statement is a double 
warning, first, that no simple configuration need occur, and second, that 


any simple configuration which does occur need not conform to the 


pattern of simple structure. It requires little imagination or artistic 
face of a sphere. The variety of 


ability to draw neat shapes on the sur! 3 
possible figures is considerable, and any closed figure may be conceived 
as a set of lines on which test points lie, or as a boundary enclosing test 


points. Simple structure is one possible configuration, Even when simple 


structure occurs, it is an open question whether the vertices of the tri- 


angle should be given a privileged position in the interpretation of the 
analysis. 
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Chapter 6 Multiple Regression Methods 


> 
This chapter and the two following deal with the techniques commonly . 


Known as multiple regression analysis, canonical correlation analysis, 
and discriminant function analysis. All three methods are instances of 
regression on a set of two or more variables. In multiple regression 
analysis, we calculate the regression of a criterion or predicted variable 
upon a weighted combination of predictor variables. In canonical cor- 
relation analysis, we calculate the regression of a weighted combination 
of criterion or predicted variables upon a weighted combination of pre- 
dictor variables. In discriminant function analysis, we treat the problem 


of classification as a problem of regression, in that we calculate the 


regression of the differences among the multivariate means of groups 
upon a weighted combination of predictor variables, Although all three 
methods are multiple regression methods, we shall reserve the name 


“multiple regression’ for the first of the three alone. 

All three techniques seek to maximize the correlation coefficient 
associated with the particular regression which is under investigation. 
Although it is customary to call the two kinds of yariables which enter 
into the analyses predictor and predicted variables, the logic of the 
techniques is, of course, independent of temporal order, The distinction 
between the logic of regression analysis and the logic of the language 
of prediction is close to Kant’s distinction between the unschematized 
category of ground and consequent and his schematized category of 
cause and effect. We find it convenient to think of cause and effect, or 
predictor and predicted, in terms of temporal succession in a given 
order, but there is no necessity for the predicted variable to come after 
the predictor variable in time. 


Multiple regression analysis 


A psychotherapist wishes to determine the type of patient who responds 
best to his treatment. so that in the future he can select such persons for 
treatment. He asks a psychologist to administer an intelligence test and 
a personality test to unselected patients before he begins treatment, and 
at the end of treatment another therapist rates the success of treatment 
on a seven-point scale. We may consider the ratings as scores on @ 
normally distributed variable. The scores are as follows: 
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Patient IQ Personality Success 
A 91 4 T 
B 100 4 
Cc 109 3 1 
D 97 0 3 
E 103 6 5 


We express these scores as deviation scores and calculate the sums of 
squares and sums of products matrix. We draw lines to separate the 


predictor variables from the predicted variable. 


TQ Personality Success 
1Q | 180 9 —48 
Personality | 9 20 9 
Success | —48 9 20 


roducts to correlations, We assume 
perfect and we make the same 
the correlation of Success with 
lation matrix R and a vector k 
the predictor variables and the 


We now convert the sums of pi 
that the correlation of IQ with itself is 
assumption for Personality. We omit 
itself. We thus obtain a predictor corre 
which contains the correlations between 
predicted or criterion variable. 


R k 
io a UNG —0:80 
Peay iss 00 0-45 
We solve b = R~k: 
R” k b 
IQ 4-023 —0-153 0-80 —0-89 
Personality 229:153, ») 1.0239) 0-45 0:58 


ghts which we apply to a 


The elements of b are the regression wei 
probable outcome of his 


Person’s standard scores in assessing the 
treatment. 
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To calculate a person’s standard score z on a particular variable, we 
require the mean and standard deviation of that variable. The standard 
deviation is obtained as the square root of the sum of squared deviations 
divided by the number of persons from which this sum is derived: 


sD = JZ 
n 


180 
5 


Thus in the case of IQ: 


SD = 


=6 


And the standard score of patient A is 


So the ‘score’ of this patient on the new vector which approximates to 
rated success of treatment is 


(—1-5)(—0-89) + (0:5)(0-58) = 1-625 


Performing these calculations for each patient gives us an estimate or 
prediction of the success of the psychotherapist’s treatment. To com- 
pare a predicted rating of success with an actual rating, we must stan- 
dardize the actual ratings. The mean of the ratings of success is 4 and. 


the standard deviation is 2. The predicted and actual ratings of each 
patient are 


| 
: 
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Success of treatment 


Patient Predicted Actual 
A 1-625 1-500 
B —0-290 0 
Cc —1-335 —1-:500 
D —0-425 —0-500 
E 0:425 0-500 


The estimates are fairly close to the actual values and the correlation 
between the two is R = 0°99. This is the maximum possible correlation 
between the criterion variable and any linear combination of the two 
predictor variables. In practice, this multiple correlation coefficient R 
(which should not be confused with the correlation matrix R) is cal- 
culated by finding the inner product of yector k and vector b and taking 


the square root of this product. 


k b Product 
—0-80 —0:89 0:712 
0-45 0:58 0-261 
R? 0:973 
R- 099 


The values of 0:712 and 0:261 are known as coefficients of separate 
determination. It is sometimes useful to express the contribution of a 
variable to the prediction by expressing its coefficient of separate deter- 
mination as a proportion or percentage of RE. Thus the contribution of 


tested intelligence is 


Ooms 
0973 3 


It is unfortunate that this method of assessing the relative contributions 


Of the variables can lead to apparently absurd results when any of the 
Coefficients of separate determination is a negative value. The use of 
Coefficients of separate determination in the assessment of the relative 
ie oe of variables in a regression equation is discussed in chapter 


Considerable care must be taken in arguing from a particular multiple 
Tegression analysis to conclusions about the relative importance of actual 
characteristics in the prediction or determination of a particular cri- 
terion. In the first place, the technique of multiple regression maximizes 
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the value of R in the particular set of data to which it is applied. In 
applying the obtained regression weights to a fresh set of data, the value 
of the correlation between predicted and actual criterion scores will 
certainly drop, and it may drop considerably. We express this fact by 
saying that the technique capitalizes upon chance variations. The square 
of the multiple correlation coefficient resembles a latent root in that it is 
a maximized rather than an estimated value. In the second place, the 
technique deals only with characteristics as measured and not with 
characteristics as such. If a measuring instrument is faulty, its unrelia- 
bility will automatically reduce the apparent weight of the characteristic 
which it purports to measure in the prediction of the criterion. In the 
third place, it is possible to disguise the importance of a characteristic 
by introducing several measuring instruments, all of which measure the 
characteristic to some extent. These variables, which will be correlated 
With one another, are each given a weight in the prediction of the cri- 
terion, and thus the true weight of the common characteristic is dis- 
persed among several variables. These caveats apply to all multivariate 
regression methods. 

The choice of variables to enter into a multiple regression analysis is 
not simply a matter of selecting those which have the highest (positive 
or negative) correlations with the criterion. It is quite possible for a 
variable to have a zero correlation with the criterion and yet for it to 
contribute indirectly by ‘suppressing’ that part of another variable 


which reduces the latter’s correlation with the criterion. Consider the 
following example: 


R k 
[io 06 | 08 
$ i] 
AAE Ee E O A NANO 
Ro k b Product 
| 1.5625 —0:9375| 08 1-25 1-00 
| ~0-9375 1:5625 | 0 —0-75 0 


3 


T) 
R 1:00 


Although the percentage contribution of the second variable to R? 
appears to be zero. nevertheless the second variable correlates perfectly 
with that part of the first variable which does not correlate with the 


ete eal ———— ae 
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criterion. Thus, by subtracting a person’s weighted Standard score on 
the second variable, we are left with that part of the first variable which 
correlates perfectly with the criterion. 

To take a hypothetical example: we might regard strength of super- 
ego as a suppressor variable. Although it is not (we may suppose) 
correlated with the likelihood of a person breaking out into actual 
aggressive action (criterion variable), nevertheless it is correlated with 


strength of aggressive drive, and aggressive drive is correlated with — wi 


aggressive action. By taking into account the strength of a person’s 
super-ego as well as the strength of his aggressive drive, we can get a 
much higher correlation between action and diathesis than we should 
by looking at either variable alone. 

The significance of a multiple correlation may be tested by calculating 
a chi square from an L-criterion or by calculating a variance ratio F. 
Reverting to our first example, we calculate 


The divisor 1 is redundant. It has been inserted to show that L is the 
ratio of non-predicted variance to total variance on the criterion yari- 
able. Another way of calculating L is to take, for each person, the 
difference between his actual and his predicted criterion score, and to 
divide the sum of squares of these differences by the sum of squares of 
the actual scores. Because the actual scores have been standardized, 
their sum of squares is equal to the number of persons, which, in the 
Present example, is five. It may be appreciated, therefore, that, just as 
in the analysis of dispersion, L is a ratio of a within or error sum of 
Squares to a total sum of squares. The multiplier for log.L is either the 
number of persons n or, more accurately, n — 1 — }(t +2) where fis 
the number of predictor variables 


ta 


5AT 42 +2) log,L 
= 7:18 


The degrees of freedom are t = 2. . 

ae Variance ratio F is the ratio of predicted to non-predicted vari- 

3 ce. The predicted variance has degrees of freedom ¢ = 2 and the 
On-predicted variance has degrees of freedom n — t — 1 = 2. The 

Variance ratio is 


R? 


R? 1 
ieee? nei 


= 
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__ 0-973 / 0-027 
= mae cee 
= 35-20 


Two things must be said about these tests of significance. The first is 
that the values of chi square and F differ slightly from those which the 
figures preceding them would suggest. This is merely a consequence of 
rounding errors in the preceding figures: the errors have been remoyed 
in the calculation of the final values. The second point to be made 1S 
that, where a variance ratio is available, it is to be preferred to a chi 
Square approximation. The latter has been introduced here partly to 
illuminate the relation between multivariate analysis of variance and 
multiple regression methods, and partly because chi square is the only 
test available in more complicated regression methods. 


Canonical correlation analysis 


A number of men have been tested on two variables, aggression and 
extraversion, and their wives have also been tested on two variables, 
intropunitiveness and passivity. We wish to establish the nature of the 
association between the personality of a husband and the personality of 
his wife. It will be convenient to regard the husband variables as pre- 
dictor variables and the wife variables as criterion variables, though 
the technique of canonical correlation makes no distinction between 
the two sets of variables; we could equally well think of a wife’s per- 
sonality influencing her husband. It is not necessary for the number 
of predictor yariables to be equal to the number of criterion vari- 
ables. 

The scores of five pairs of spouses are as follows (each score is ex- 
pressed as a deviation from its test mean): 


Husband Wife 
Marriage Aggression Extraversion Intropunitiveness Passivity. 
m, —2 = il —2 
My —1 —2 =2 —1 
m; 0 2 0 2 
my 1 1 2 0 
m; 2 0 1 1 


The correlations between the variables are as follows (each variable 
is assumed to correlate perfectly with itself): 
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Husband Wife 
Aggression Extraversion Intropunitiveness Passivity 
Einstand A 1:0 0-5 0-8 0-7 
E 0-5 1:0 0-7 0-8 
TE 7 a = 
Wife | 0:8 0-7 0 5 
Pp | 0-7 0-8 0:5 1:0 


The four submatrices of this correlation matrix may be symbolized by 


The present example is particularly simple. In the general case no 


submatrix will be the same as any other submatrix, except that Ry will 
always be the transpose of Ry». Submatrices Ry, and Ra will not, in 
general, be symmetric; there is no reason to suppose that husbands’ 
aggression and wives’ passivity will correlate to exactly the same degree 
as husbands’ extraversion and wives’ intropunitiveness. 

We now calculate the latent roots (p) and latent vectors (b,) of the 


following equation: 
(Ry Ru Ris — p;Rog)b; = 0 
where the values of p; are such that 
[Ro Ri Rie — pRoo| = 0 
The equation may also be written 
(Roo Ray Ri Ris — pI)b; = 0 


Since Ry is 


ar 
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and Ryo! is (in this example) identical with R,,-! then the matrix 
Roo RoR Rip is 


0-52 0-48 
0-48 0-52 


This matrix is again exceptional: in the general case it will not be 
symmetric; its latent roots and vectors are 


b; ba 
v05 —y0:-5 
vo0:5 0:5 

u 100 0-04 


We now calculate a vector a; from each b; by the formula 


a; = Ry Rib; 
and we obtain 
ay a 
V05 —4vy0:5 
4/0-5 4/0:5 


The identity between a; and its corresponding b; is exceptional and is 
due to the simplicity of the example. 
The results of our calculations can be set out as follows: 


a, az 
Aggression v05 —y0:5 
Husband Extraversion 4/0°5 /0:5 
b, b; 
n Intropunitiveness VOS —y0:5 
Wife si 
Passivity 4/0:5 /0:5 
a 1-00 0-04 


The best way to grasp the meaning of these weights is to calculate the 
standard score of each person on each variable and apply the weights 
to the standard scores. The standard scores are 
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Husband Wife 
Marriage Aggression Extraversion Intropunitiveness  Passivity 
m, —/2:0 —V0°5 —0°5 —¥V2:0 
m —0°5 —/2:0 —V2:0 —V0°5 
ma 0 /2:0 0 /2:0 
m, 0:5 4/0:5 4/2:0 0 
ms 20 0 yor vors 


We apply the regression weights a, to the scores for each husband and 
the regression weights b, to the scores for each wife. For example, in the 
first marriage, we calculate 


(—V2(V/05) + (—VO-SNKV0'5) = 15 
for the husband and 
(— 0-505) + (1V D(V0:5) = = 15 


for the wife. For all five marriages the pairs of scores on the first 


canonical variate (as it is called) are 


Marriage Husband Wife 
m; —1-5 —1:5 
mo —1-5 —1:5 
m, 1-0 1-0 
m, 1-0 1:0 
m; 1:0 1-0 


The correlation between the two sets of scores is equal to the square 


root of the first latent root, that is, it is equal to unity. The latent (or 
Canonical) root in canonical correlation analysis corresponds to the 
square of the multiple correlation coefficient in multiple regression 
analysis. With one criterion variable we obtain one multiple correlation; 
With q criterion variables we obtain q canonical correlations. Strictly 
peaking, the number of canonical roots is equal to p or q, whichever is 
a where p is the number of predictor variables, and qis the number 
és Criterion variables. The mathematics of canonical correlation does 
Ot give logical priority to either set of variables. 

ats ie of regression weights is known as a canonical vector, and the 
ri of derived scores which results when the weights are applied to 

ndard scores is known as a canonical variate. The first canonical 


Variate j 
arlate in the present example shows & perfect correspondence between 
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the personalities of husband and wife when each characteristic is 
weighted equally within each of the two sets. A husband who is low on 
aggression and extraversion has a wife who is low on intropunitiveness 
and passivity, and so on. 

Now we may apply the regression weights a, and b, to the standard 
scores to obtain the second canonical variate: 


Marriage Husband Wife 
m, 0:5 —0°5 
m —0°5 0-5 
m, 1:0 1-0 
m, 0 —1-0 
m; —1-0 0 


The correlation between the canonical variate scores of husbands and 
wives is 0-20, which is equal to the square root of the second latent root. 

We may now examine these two canonical variates in order to appre- 
ciate their defining properties. If the husbands’ scores on the first variate 
are correlated with their scores on the second variate, it will be seen that 
the correlation is zero. The same is true of the wives’ scores. Canonical 
variates have the property of being orthogonal to one another within 
the set of predictor variables and within the set of criterion variables. 
Furthermore, each predictor variate is orthogonal to all but one of the 
criterion variates. Thus the correlation between husbands on variate one 
and wives on variate two is zero, and so is the correlation between wives 
on variate one and husbands on variate two. The canonical correlations 
are usually reported in descending order of size. The first canonical 
correlation therefore represents the maximum possible correlation be- 
tween any weighted linear combination of predictor variables and any 
weighted linear combination of criterion variables. In geometrical terms, 
it represents the cosine of the smallest possible angle between any vector 
in the predictor space and any vector in the criterion space. 

In a typical analysis the predictor variables may be observations of 
the early environment of a group of persons, while the criterion vari- 
ables are measures of their later performance. Or the predictor variables 
may be measures of the personality and aptitude of students about to 
take a course, while the criterion variables are measures of their success 
in various aspects of their subsequent career. Or again, one might ask 
two judges to rate the same set of persons on any constructs which the 
judges care to name (see chapter 5). It is quite possible that the distri- 
bution of points (rated persons) in the space of the first judge will be 
similar to the distribution of points in the space of the second judge, 


even though what one judge calls forcefulness is called dominance by 
the other, and vice versa. 
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The test of significance for a multiple regression analysis with only 
one criterion variable involves the calculation of L = 1 — R°. In the 
case of a canonical correlation analysis we calculate 


L= (1 — m)(l — p(l — pa) - - + (1 — Fa) 


where q is the number of canonical roots. In our example there are two, 
and so 
L= (1 — 1-00)(1 — 0:04) = 0 


= 2, the number in the 


The number of variables in the first set is p 
5. We make an 


second set is q = 2, and the number of persons is 7 = 
overall test of all the canonical roots by calculating 


x = —[n — 1 — {p +g + Vlog L 
[5-1-4342 +2 + 1)]log.0 


= —1-5 log,0 
= infinity 
with degrees of freedom 
pg=2x2=4 


orrelations is perfect, chi square 


When one or more of the canonical c uar 
f deviation from hypothesis is 


becomes infinite because no measure o! 
available. It is still possible, however, to break down y? into tests of the 
individual canonical roots. x? may be regarded as a test of the first and 
all subsequent roots. We now calculate x3 for the second and all sub- 


Sequent roots. For this purpose we require 


L=(1— nal — ma) - ++ (1 Pe) 


term L = (1 — 0:04) 


Tn our example this product has only the one 
d and so we calculate 


= 0-96, The multiplier for log.L remains unchange 


2—= —1:5 log,0-96 
= 0-06 


X 


) where k = 1 is the number 


T 
he degrees of freedom are (p — Da — K 
Iculation of L. 


o ; 
Toots which have been omitted from the ca 


af. =- DC- D 
=1 
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The tests of significance may be set out as follows, the chi square for 
the first root being the difference between yj and y3. 


Test of x af. 
All roots Infinity 4 
All roots except 

the first 0-06 1 
First root Infinity 3 


If, as in this example, there are only two roots, the second chi square is 


a test of the second root. If there are more than two roots, we may 
calculate for the third root 


L= (1 = m)l — pa) - (1 — pa) 


which yields a chi square with (p — 2)(q — 2) degrees of freedom. By 
subtracting this third chi square from the second, we obtain a test of 
the second root. The sum of the chi squares for the individual roots is 
equal to the first chi square. 


b Lawley (1959) has shown how the multipliers for L may be modified 
in order to improve the chi square approximations. 


Interpretation of a regression analysis 


The reader who wishes to see some pattern and meaning in the tech- 
niques of multiple regression analysis may care to consider the basic 
formulae for the relations between two sets of variables. These may be 
set out in terms of the partitioned matrix: 


This is the same matrix as before, except that letters have replaced 
numbers in the subscripts. The elements of the submatrices are cor- 
relation coefficients if the tests have been normalized, and variances and 
covariances if they have not. 

The basic formula states how much of R,, (or rather, of the score 


matrix from which it is derived) can be predicted from, or accounted 
for, by the x variables: 


RyzRex ‘Rey 
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The deviati i 
on from regression—that is, the i 
part of Ry, which c 
be accounted for by the x vanabes -s = b 


Ry — RyzRi2 Rey 


T , 

She aethon o multiple regression analysis is a particular case of 

Pa Face ation analysis, and we may illustrate these formulae by 

anih m to the example of multiple regression which was used to 
n the nature of a suppressor variable. The partitioned correlation 


matrix for this analysis is 


1:0 06 | 0-8 
0-6 1:0 | 0 
0-8 0 1-0 


The reciprocal of Rẹ» appears as R- above. The regression formula 


R,,.R,.'R,, is therefore 


| os o] 1:5625 —0:9375 0:8 a 
09-9375 1:5625 o|] | 


This i $ a 
his is the value of the square of the multiple correlation coefficient, 
f the criterion (y) which can be 


ie is the proportion of the variance © n ( 
Sen ‘an for by the predictor variables (x). The deviation from regres- 
1:0 —1:0=0 
al with the first, and 
in which one set of 
In some analyses it may be that the 
logical s l nical correlation analysis have equal 
sion ee and this applies also to the special case of multiple regres- 
calcul alysis. We may, therefore, turn the regression formula round and 
ulate the regression of the x variables on the y variable: 


on analysis is identic 


The R? of a multiple regressi 
| correlation analysis 


on i 

oe latent root of a canonica 

be rables as only one member. 
sets of variables of a cano 


RiRyy Roz 


Th bts 
e deviation from regression is 


Ree — RarRyy Rue 
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Since R,, has only one element, and this has a value of unity, the reci- 
procal of R,, is also unity. The first equation is therefore 


0:8 | i | | os a | 0-64 0 
0 — tL o 0 


and the deviation from regression is 


10 06 064 0 
06 10 0 0 


A consideration of the preceding formulae is important for the inter- 
pretation of any multiple correlation or canonical correlation analysis. 
A canonical variate resembles a principal compcnent in that it is a 
dimension which has been chosen so as to maximize a certain value. 
The logic of maximization is that it facilitates prediction. The research 
worker, however, is interested not only in the success of a prediction 
but also in the reasons why it is successful. He therefore wishes to stand 
the logic of regression on its head and to interpret the meaning of his 
predictors. He attempts to do this by looking at the relative weights of 
the tests in the weighting vectors used to calculate a canonical variate, 
just as he interprets a component by looking at the relative weights of 
the tests in its weighting vector. This is often an adequate procedure, 
but the interpretation it suggests should be checked by looking at the 
relative positions of known persons on the canonical variate or the 
principal component. Sometimes the procedure is not adequate because 
a canonical variate is an untidy compound of diverse and uninterpret- 
able variables. Such variates are not uncommon in the social sciences, 
partly because choice of variables is not dictated by established theory, 
and partly because tests have a low validity. Methods of correlation, 
regression or covariance are dangerous if one cannot know whether the 
correlation between a test and some other measure is due to the valid 
or the invalid portion of the test’s variance. 

The occurrence of a single non-zero canonical correlation or multiple 
correlation indicates that, by an appropriate weighting of the x and the 
y variables, one set of variables can be projected into the space of the 
other along a straight line. The existence of two non-zero canonical 
correlations indicates that each set projects into a two-dimensional plane 
of the other set. If two non-zero canonical correlations are of equal 
size, then their associated canonical vectors are not uniquely determined. 
Tn consequence, the two canonical variates associated with the equal 
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canonical roots may take up any positions in their planes, always pro- 
viding that they remain at right-angles to one another. Thus no con- 
fidence can be placed in the weighting of tests in two weighting vectors 
whose roots do not differ in size. This holds true, even if both the 
canonical correlations are unity. It should not be forgotten, however, 
that the relation between a weighting vector of the x tests and a weight- 
ing vector of the y tests is constant. However indeterminate a pair of x 
weighting vectors may be (because their canonical roots are equal), once 
they have been stabilized, even if arbitrarily, then the position of the 
corresponding pair of y vectors is also fixed, by virtue of the equation 
linking the two. 

The value of analysis by canonica 
situations is to establish that certain relations between the two sets of 
variables do occur. The nature of the relations must be determined from 
considerations other than an examination of the canonical vectors. 

A useful technique to aid the interpretation of an analysis is the 
examination and cor 1parison of three sets of dimensions: (a) the canoni- 
cal vectors, which rzlate the two sets of tests, (b) the latent vectors of 
the original correlation matrices Re» and R,,, and (c) the latent vectors 
of these matrices after subtraction of their shared dimensions. In the 
preceding example we know that all the variance of the y variable a 
accounted for by regression, but we may compare the latent vectors o; 
Res before and after subtracting the part accounted for by y. The sum 
of squares of each vector f; is set equal to its associated latent root A; 


1 correlations in these doubtful 


Matrix Ay Ag f f: 
1-0 0-6 4/08 0-2 
Bef 16 04 
iii 06 10 Vos —v0-2 
0:6 = 
0:36 0-60 
After 1:36 0 4 
0-60 1-00 1 g 


of 


auivalent to the first, and only, canoni 
Meme elements were found to be 1- 
aris ers of a set of canonical correlations 

es only if the smaller of the two sets 0 
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the space of the larger. Our second example of multiple regression satis- 
fies this condition: the line y lies entirely within the plane of the x 
variables. 

The geometrical representation of a principal component analysis 
was described in chapter 5. The projection of a vector on to a plane was 
illustrated by the analogy of a gnomon casting its shadow on to a dial 
when the sun is directly overhead. The same analogy serves to illustrate 
the geometry of a multiple regression analysis. The two-dimensional 
surface of the dial is a plane whose (correlated) axes are the two pre- 
dictor tests. The gnomon is the criterion variable. It must be understood 
that all three vectors have both a positive and a negative end, and all 
three cross at a common origin, which is the centre of the dial. The 
multiple correlation coefficient is the cosine of the angle between the 
gnomon and its own shadow. This angle is the smallest possible angle 
between the criterion vector and the plane of the two predictor vari- 
ables. In arithmetic terms, it is the correlation between the criterion and 
a suitably chosen weighted combination of the predictor tests. 

It should be noted that the multiple correlation is the cosine of an 
angle between a line and a plane. The axes of the plane are irrelevant 
to the value of the correlation. R remains unchanged if the axes are 
rotated; in particular it remains unchanged if the predictor tests are 
replaced by their principal components. It must be emphasized, how- 
ever, that R may be drastically altered if any dimensions are omitted 
from the space of the predictor variables. R is the same for both the 
predictor tests and the predictor components only so long as all the 
components are retained. The size of a latent root is irrelevant to the 
size of R. If the research worker omits the component accounting for 
the smallest proportion of the predictor variance, he may in fact reduce 
R from unity to zero. The sundial analogy may make this clear. Let us 
suppose that the gnomon lies actually on the plane of the dial. R is then 
unity. If the gnomon lies along the line of the second, and smaller, of 
the two components, then omitting that component will reduce the 
plane of the dial to a single dimension, and the gnomon will have no 
projection on to the one-dimensional line of the first component. R is 
then zero. 

Several advantages are gained by calculating regression weights for 
uncorrelated components rather than for correlated tests. The calcula- 
tions may be carried out by simply substituting each person’s set of 
component scores for his variable scores. In the case of a canonical 
correlation analysis, each person has two sets of component scores, one 
for each set of variables. In order to retain immediate comparability 
between the regression in terms of tests and the regression in terms of 
components, the components should be the components derived from 
the appropriate correlation matrix. The partitioned matrix of the re- 
gression analysis contains variances and covariances of components. 
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— the matrices Ra and Ry, for the analysis in terms of components 
in i gre of components in the diagonal cells and covariances 
oot he h _— cells. Since all covariances are zeros the matrices 
tiene gonal matrices, and since the variance of a component is its 
Pit each diagonal element is a latent root. R,, and R, contain 
ee nces between components of one set and components of the 
, Since the canonical roots are not affected by linear transformations 
in either or both sets of variables, it is not essential to make the same 
transformation of both sets. In some circumstances it may be appro- 
Sora to relate the variables of one set to the components of the other. 
ioe are carried out, one entirely in terms of variables and 
lib her entirely in terms of components, the research worker is at 

erty to report any of the four possible combinations of the two sides 
of the two analyses. 

The calculations in terms of components will be illustrated by a 
further analysis of the example which was used to explain the nature of 
a suppressor variable. When, as in this case, one of the sets contains 
only a single variable, that variable is standardized, and the standard 
scores are treated as scores on the one and only component of the set, 
which has a latent root of unity. The partitioned yariance-covariance 


matrix of the components is 


1-6 0 0-84/0:5 
0 0-4 0-8/0°5 


0-84/0°5 0-84/0°5 | 1-0 


check the calculation of this matrix 
available the test scores from which 
ed. He therefore cannot calculate 
however, to have the original par- 
f tests. Both the multiple regression 


reader is not in a position to 
ee he does not have i 
a original correlations were deriv 
TA component scores. It is enough, 
ioned correlation matrix in terms 0 
analysis and the principal component analysis are derived from this 
oe and the one may be expressed in terms of the other without 
course to the original data (Kendall, 1957, p. 71). 
oe new matrix Res is particularly easy to invert because it is a 
Fog matrix. The inverse is obtained by replacing each diagonal 
aoe of Res by its reciprocal. The multiple regression analysis pro- 
s in the usual way (on p. 98). The value of Re is the same as before 
a we have simply changed to an alternative set of reference axes, 
z out altering the predictor space 1n any way. The relative contribu- 
ons of the components are given in the product column. Although 


98 Methods of Multivariate Analysis 


R37 k b Product 
1 0 
T6 0:84/0-5 0-54/0-5 0-2 
0 1 $ . 
os 0-8+/0:5 2-0/0-5 0-8 
0-4 
Re 1-0 


these products are coefficients of separate determination, they do not 
have any of the disadvantages which such coefficients often possess (see 
chapter 10). The determination is in each case truly separate because the 
components are orthogonal to one another. 

The criterion variable may be referred to the components of the pre- 
dictor variables by means of a polar co-ordinate analysis, with its 
attendant advantages. In order to calculate the necessary angles, each 
element of b must be multiplied by the square root of its associated 
latent root. The first element becomes 0-54/0-54/1:6 = 0:2 and the 
second element becomes 24/0:54/0:4 = 4/0-8. The cosine of the angle 
between the criterion variable and the second component, within the 
plane of the first two components of the predictor space, is 


4/0:8 
VO s +03) ~ 08944 
which gives an angle of 26°34’. The angle between the criterion variable 
and the first component is therefore 90° — 26°34’ = 63°26’. We know 
that, in this particular example, the angle between the criterion and its 
projection in the predictor space is zero, because the multiple correlation 
is perfect. In other words, all three variables lie in a two-dimensional 
space. We have, in various tables, reported 
(a) the cosines of the angles between the tests (the partitioned cor- 
relation matrix), 
(b) the cosines of the angles between predictor tests and components 
(the matrix of latent vectors), 
(c) the angles between the criterion variable and the components of 
the predictor tests (immediately above); 
and we know 
(d) that the components must be uncorrelated with one another. 
The whole set of cosines is set out in the matrix below. By translating 
the cosines into angles and constructing a rough diagram, the reader 
may verify that the angles form a self-consistent set on the plane of a 
sheet of paper. The predictor tests are labelled Pı and p,, the criterion 
is q, and the components are f, and f». In order to make the elements of 
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each f; into cosines, its sum of squares has been set equal to the asso- 
ciated latent root. 


Pi P2 q f, fə 
Pi 1-0000 0-6000 | 0-8000 0:8944 0:4472 
Pe 1:0000 | 0 0:8944 —0-4472 
a 1-0000 | 0:4472 0-8944 
f 1-0000 0 
i 1:0000 


In the case of a canonical correlation analysis, every canonical vector 
may be referred to the components of each set of variables, and the two 
sets may be plotted on two spherical maps. Canonical vectors which are 
associated with small and non-significant canonical roots will no doubt 
be omitted from the plot. The choice of components to enter into a map 
will be dictated by their importance in the retained canonical vectors 
rather than by the size of their latent roots. 

Let us now stand back from the calculations and try to visualize the 
geometry of a canonical correlation or multiple regression analysis. We 
have two sets of variables, one consisting of x variables and the other 
consisting of y variables. If we let x and y stand for the number of 
variables in each of the respective sets, then we have x -+ y variables in 
all. In general, these two sets will yield q canonical variates, where q is 
equal to the smaller of x and y. But each canonical variate has a dual 
representation; we have, for example, the scores of wives and also the 
Scores of husbands on the first variate, and similarly for each subsequent 
variate. It is an easy matter to correlate each of the x + y variables 
with each of g + q columns of canonical variate scores. These correla- 
tions are of interest in themselves when we are trying to interpret the 
variates in terms of the variables. 
sa interpretation may be carri 
7 calculate the latent vectors of 

es, and use them to define a set 0 
ee of the x space. Each variabl 
is space. Chapters 4 and 5 showe 


on variable upon a whole set of pr r 
ubset. In a very similar way each canonical variate may be represented 


as a pair of vectors in the x space, and the positions of these vectors 
eel be plotted. All we need to know is the correlation or cosine be- 
Ween each vector and each principal component. The calculation of the 


‘ed two stages further. In the first stage 
the space of the standardized x vari- 
f orthogonal axes (principal com- 
e may be represented by a vector 
d how to calculate the projection 
incipal components or upon any 
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polar co-ordinate representation of a canonical vector does not differ 
in principle from the calculation of the representation of a variable. 
Calculations analogous to those for the x space may be carried out for 
the y space. 

The second stage of the geometrical analysis follows from the fact 
that the variables may be considered as defining a single, overall space 
of x + y dimensions. Each canonical variate has a dual representation 
in this space, and the two vectors may be plotted from the correlations 
or cosines between the pair of canonical variate scores and the com- 
ponents of the whole space. The analysis is formally identical with the 
Separate analyses of the x space and the y space. 

Thus regression vectors may be mapped into three different spaces, 
or into any subspace of any of the three. Two of the three spaces con- 


stitute overlapping parts of the third space, which is the space of the 
whole analysis. 


Sample size 


Throughout this book multivariate techniques are illustrated by analyses 
of very small sets of data. It should be obvious that in practice, complex 
techniques should not be applied to small samples of persons. There is 
no general rule, however, to indicate how large a sample must be if it is 
to be subjected to multivariate analysis. Minimum appropriate sample 
size depends on the purpose which the conclusions of the analysis are 
to serve, as well as on the mathematical properties of different sizes of 
matrix. It is possible to set a lower bound to sample size by calculating 
the maximum dimensionality of the data. If a canonical correlation 
analysis is carried out on two sets of variables, each with 10 members, 
then the 20 x 20 correlation matrix has a maximum of 20 positive 
latent roots, which means that 20 dimensions may be required to ac- 
count for the data. If, however, the sample from which the correlations 
are calculated contains only 16 persons, the matrix cannot have more 
than 15 positive roots, since a collection of 16 points cannot generate 
a space of more than 15 dimensions. A canonical correlation analysis 
would therefore necessarily show that the two spaces, one associated 
with each set of variables, have at least 20 — 15 = 5 dimensions in 
common. In other words, five of the canonical correlations would in- 
evitably be unity. It is arguable that an analysis suffering from this kind 
of constraint may be worth carrying out in certain circumstances. 
Nevertheless, the research worker should realize why it is that he 
achieves such apparently satisfactory canonical roots. If, in the example, 
the number of persons fell to 10 or below, the analysis could not be 
carried out in the ordinary way because the inverse matrices of Raz and 
R,, would be singular; that is, they would have at least one zero latent 
Toot. It would be possible to replace scores on the variables by scores 
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on the components and to carry out the analysis on the component 
scores. While such an analysis is often of value, it may prove unprofit- 
able if it is resorted to simply because the sample is small. 

, Readers should perhaps be reminded that the purpose of most statis- 
tical analyses is to find out something about a collection of persons or 
things which is wider than the collection actually entering into the 
analysis. If generalizations are to be made from the sample to its 
population, it must of course be a random sample. m 
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Chapter 7 Discriminant Function Analysis 


The discriminant function was introduced by R. A. Fisher in 1936 as a 
Statistical technique to facilitate the classification of things or persons. 


The case of two groups 


We are given two classes of objects or persons. t measurements have 
been made on each person, and we wish to find the weighting vector 
which, when applied to some newly observed and unclassified person, 
will assign him to one or other of the classes with the smallest prob- 
ability of error. We assume that, in the populations from which the 
classes are drawn, the t variables have a common multivariate normal 


distribution. The vector of weights (w) which provides the optimum 
assignment is given by 


w= Vd 


where d is the vector of differences between the 1 pairs of means of the 
two classes, and V is the weighted average of the dispersion matrices of 
the two classes. 

The analogy with the multiple regression equation 


b = Rk 


should be evident. Both w and b are vectors of weights. The vector of 
differences between group means on the variables (d) is analogous to 
the vector of correlations between the predictor variables and the cri- 
terion variable (k). The exact nature of the analogy is illustrated in the 
following chapter. 

Let us suppose that we wish to advise an undergraduate whether an 
Arts or a Science course may be most suitable for him, and the informa- 
tion on which the advice must be based is derived from two aptitude 
tests which he completes on entering university. We take the admission 
test scores of three successful Arts graduates and three successful Science 
graduates (the numbers in the two groups need not be equal) as the basis 


for our advice. The graduates obtained the following scores on the two 
tests, X and Y: 


4102 
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Arts 
x n ¥ 
Pi 3 6 
Pe 4 4 
Ps 5 8 
Mean 4 6 
Science 
X Y 
Ps 1 2 
Ps 2 0 
Pe 3 4 
Mean 2 2 
We calculate the sums of squares and sums of products matrix of each 
group: o 
Arts Science Sum = W 
x Y x Y x ¥ 
2 2 2 2 | 4 4 
2 8 2 8 | 4 16 


(i.e. the sums of squares and 


The dispersion matrices of the two groups 
less than the number of per- 


sums of products matrices divided by one 
Sons in a group) are identical. Discriminant function analysis assumes 
that the dispersion matrices of the two groups are both estimates of a 
common dispersion matrix. If the two dispersion matrices differ appre- 
Ciably, the analysis cannot be carried out. The appropriate test of 
Significance is the test of homogeneity described in chapter 2. We cal- 
Culate the weighted average of the dispersion matrices by summing 
them to obtain matrix W and dividing W by 3(% — 1) where 7 is the 
number of persons in the gth group. We obtain the weighted average 
within group dispersion matrix V 


The differences between the means of the two groups (let us say, Arts 
z 
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cs = P ` A 2 
minus Science, it does not matter which order we take them in) are +2 


n test X and +4 on test Y. We calculate the inverse of V and multiply 
it by the vector of mean differences d. 


d w Product 
2 $ $ 

x 4 3 $ 
D = d'v-4 = y 


Just as we calculated R2 as the inner product of k 
calculate a quantity known as Mahalanobis’ D? 
d and w. The significance of D? 


and b, so we may 


Bx)+6xy7=% 


The scores of all six persons are 


Arts 


Science 
24 
Pi s Pa $ 
Pe s Ps $ 
Ps E Po $ 
Mean 3$ ka 


The within group sum of squares of these two sets of scores is 214, 
The within group mean square is this value divided by the 4 within group 
degrees of freedom. Thus the within group mean square is equal to D?, 

It is convenient to choose weights w such that the weighted average 
within group mean square of the discriminant function scores is unity. 
To achieve this, we divide each element of w by D = 2-3094. The vector 
of weights is then 


Test w 
4 

A 3D 
2 

Y ees 
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When we apply these modified weights to the original test scores, we 
obtain the following standardized discriminant function scores: 


Arts Science 
Pi 3-4641 Ps 1-1547 
Pe 3-4641 Ps 1:1547 
Ps 5-1962 Pe 2-8868 
Mean 40415 1:7321 


In addition to having unit within group variances, and hence unit within 
group standard deviations, these scores have the further property that 
the difference between the two group means is equal to D, 


4-0415 — 1:7321 = 2:3094 


Mahalanobis’? D is therefore the distance between two groups on a 
dimension which has unit standard deviation within the groups. The 
actual standard deviation in any one group will, in practice, rarely be 
exactly unity. It is the average of the mean squares, with each mean 
square weighted by the number of persons in its group, which is exactly 
one. Because the groups are assumed to have a common dispersion on 
the original variables, the deviations from unity on their discriminant 
dimension are attributed to sampling error. 

In order to use our research to give advice to an undergraduate, we 
should administer the aptitude tests X and Y to him and then calculate 
his score on the discriminant function. If this score exceeds a certain 
cutting score, we should advise him to take an Arts course; otherwise 
we should advise a Science course. For some purposes the appropriate 
cutting score is the half-way mark between the means of the two groups. 
Half the distance between the means is 


4D = 1:1547 


and so the half-way mark is 
4:0415 — 1:1547 = 2-8868 


Because the standard deviations within the groups are unity, }D is a 
unit normal deviate. If we look it up in a table of the normal curve, we 
find the probability that a person who really belongs to one group will 
be incorrectly assigned to the other by the discriminant function. In our 
example the probability of misclassification is 0-13, which means that 
we may expect 87 per cent of our assignments to be correct. 

We may, of course, wish to suspend judgment when asked to assign 
persons who fall near the half-way mark of the discriminant function. 
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If we decide not to pass judgment on anyone who has a 10 per cent 
chance of belonging to group A although he lies nearer to B than to A, 
we set aside the appropriate stretch of the discriminant function as a 
no-man’s-land. The 10 per cent level of the unit normal deviate is 
1:2816. Any person who falls above 4-0415 — 1-2816 = 2-7599 has at 
least a one in ten chance of belonging to the Arts group. A student who 
scores 2-80 on the discriminant function is nearer to the Science mean 
than to the Arts mean, but he still has at least a one in ten chance of 
belonging to the Arts group. He falls within the no-man’s-land. We may 
decide not to commit ourselves to a decision or, if we do, to make only 
a tentative assignment. A similar suspension of judgment may be made 
if a student lies closer to the Arts group than to the Science group, but 
within 1-2816 of the mean of the Science group. 

The half-way mark on the discriminant function may not be the most 
appropriate or efficient cutting point. If one group occurs more fre- 
quently than the other in the population from which we are drawing 
members, then the cut-off point should be moved towards the mean of 
the less frequent group. The probability of misclassifying members of 
the infrequent group is thereby increased, and the probability of mis- 
classifying members of the more frequent group is decreased. By a 


suitable choice of cutting point, the overall probability of misclassifica- 
tion may be minimized. 


Let us suppose that the ratio of successfu 
Science graduates (in the population from which our random sample of 


assignees has been drawn) is 2:1. We divide the larger of these two 
numbers by the smaller and take the natural logarithm, 


1 Arts graduates to successful 


log,2 = 0-6932 
We divide this value by D, 


0:6932 
23004 = 0-3002 


We move our cutting point this distance towards the Science group, 
2-8868 — 0:3002 = 2-5866 


The unit normal deviate for the Arts group is now 4-0415 — 2-5866 
= 1:4549. The probability of wrongly classifying an Arts person as a 
Scientist is now only 0-07 or 7 per cent. The unit normal deviate for the 
Science group is 2:5866 — 1:7321 = 0-8545, which gives a misclassifi- 
cation rate of 0-20 or 20 per cent. Since two-thirds of our population 
are Arts people and one-third are Scientists, the estimated overall rate 
of misclassification, with a cut-off point of 2-5866, is 
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(0:07 x 3) + (0-20 x 3) = 11% 


In some circumstances the relative costs of errors must be taken into 
account in the determination of the optimum cutting point. Relative 
costs or risks may be dealt with in the same way as relative frequencies. 

The decision to employ a discriminant function for purposes of classi- 
fication implies a prior decision that the person to be classified belongs 
to one or other of the two groups. The prior decision may be put to the 
test by calculating the chance that the person belongs to the first group 
and the chance that he belongs to the second group. If neither of these 
probabilities exceeds, say, 0-05 we may decide that the discrimination 
is irrelevant to that person. Let us suppose that a student scores 2onX 
and 8 on Y. We express these scores as deviations from the mean of the 
Arts group, and call the vector of deviations qd). 


d, 
2-4=-2 
8—6=+42 


We calculate y? = d,’V,"'d, where V, is the dispersion matrix of the 
Arts group. In a problem such as this, where the two populations are 
assumed to have a common dispersion matrix which is estimated by V, 
we may substitute V~? for V=. 


| -2 +2 | | $ -3 


This chi square has the value 9-33, with degrees of freedom ¢ = 2. The 
probability that the student belongs to the Arts group is therefore less 
than 0-01. We perform exactly the same calculation on his deviations 
from the mean of the second group, dẹ. The reader may verify that chi 
Square for the Science group is 12-00, p < 0-005. The student does not 
appear to be a candidate for either an Arts or a Science career, and 
there is therefore no purpose in asking which group he most resembles. 

If, however, we were to calculate his score on the standardized dis- 
criminant function, we should find that it was 3:4641, which is only 
about half a unit normal deviate from the mean of the Arts group. The 
discriminant function is a line joining the means of the two groups. 
When the student is projected on to the one-dimensional space of this 
line, he falls within the Arts group. When, however, he is allowed to 
take up his position in the two-dimensional space of the original tests, 
he is seen to be at a considerable distance from the means of both 
groups. 
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At some point in a discriminant function analysis it is customary to 
perform an analysis of dispersion, that is, to test the multivariate dif- 
ferences between the means of the two groups to see whether we are 
correct in supposing that they are not drawn from one and the same 
population. This may seem a nonsensical test, since, before embarking 
on a discriminant function analysis, we almost invariably know that our 
groups are drawn from different populations. But the population re- 
ferred to in the null hypothesis of the analysis of dispersion is a popula- 
tion of measures or characteristics of persons, not a population of per- 
sons qua persons. The set of characteristics which we have chosen for 
the purpose of discriminating the two groups may include no measures 
on which they actually differ. 


In chapter 2 the analysis of dispersion involved the calculation of 


= 


i 
IT] 


where T is the total sums of squares and sums of products matrix, 


16 40 


and so 
L=#% =3 


The test of significance described in chapter 2 was an approximate test, 
but an exact test is available for the case of two groups, and the variance 
ratio is 
polclymtmat—l 
L t 


The degrees of freedom are ¢ and (n, + na — t — 1), that is, 2 and 3. 
The variance ratio may be regarded as a test of the significance of a 
discriminant function. The effectiveness of a function may be quantified 
by calculating the square of a multiple correlation coefficient R. The 
square of a correlation coefficient can be considered as a ratio of two 
variances. The value of D? can be converted into an R? by calculating 
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the ratio of (a) the variance due to the linear regression of the differences 
between the groups, to (b) the total variance. The formula is 


where ¢ (phi) is 
NN 


= (my + ng)(m + n — 2) 
The value of ¢ is 


x D? 


(3)(3) ee 
G@+36+3-2° 3 
eo 
ahead 1S A 
R= 0-82 


The quantity $, which appears in a more general form in the section 
on the canonical analysis of discriminance, is closely related to the 


variance ratio, in that 
(my +e — t — 1) 


F 
t 


_ G4+3-2-12 
5 3 


= 3-00 


¢ is actually the ratio of between to within sums of squares of scores on 
| the discriminant function. The within group mean square of these scores 
was set to unity, and so the within sum of squares must be one multi- 
Plied by the within degrees of freedom, which are X(n, — 1) = 4. The 
deviation of both group means from the overall mean is 4D, and so the 


between groups sum of squares is 


En (GDF = 8 
which gives 


110 Methods of Multivariate Analysis 


In addition to estimating the significance, power and classificatory 
efficiency of a discriminant function, it is useful to estimate its precision 
(Fisher, 1940). Calculating the precision of a discriminant function is 
analogous to setting confidence limits. In effect, we set limits to the 
weighting vector w. We do not, however, calculate confidence limits or 
standard errors for each individual element of w. Rather, we calculate 
the minimum within group correlation between our obtained discrimi- 
nant function and any other function which might purport to be equi- 
valent to the obtained function, within the limits of error. We must 
decide what limits to set. Let us suppose that we decide to work at the 
10 per cent level; that is, we decide to reject at the 10 per cent level any 
discriminant function whose correlation with the obtained discriminator 
within the groups is less than r. We begin our calculation of r by looking 
up F with degrees of freedom ¢ — 1 and nı + n — t — 1. In the 10 per 
cent table of F, F with degrees of freedom 1 and 3 is 5-54. Multiplying 
this variance ratio by the ratio of its degrees of freedom gives us a ratio 
of sums of squares at the 10 per cent level. 


(t—1)F 5-54 


ae el 3 


= 1°85 


From this value of 41) we calculate the equivalent value of R?: 


ge - to 


T+ by 
_ 1:85 

= 385 

= 0-6491 


The ratio of the 10 per cent level of R? to the obtained value of R? is a 
measure of the tolerable deviation from correlation within samples. 
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p= 07 


Any discriminant function which is not significantly different from the 
obtained function must correlate with it to the extent of 0-17 or more. 
The low value of r is an indication that virtually no reliance can be 
placed on the obtained elements of w as guides to the true nature of the 
distinction between the two groups. 


Multiple discriminant analysis 


The technique of discriminant function analysis has been generalized to 
the case of more than two groups (Fisher, 1938; Rao, 1948; Rao and 
Slater, 1949). The general method may be illustrated by applying it to 
the two-group case considered in the preceding section. Instead of cal- 
culating a single vector of weights, we now calculate a vector of weights 
for each group entering into the analysis. In order to do this, we replace 
the vector d by the vector of means of each group in turn (m;). We shall 
represent the vectors of weights by l; to distinguish them from the 
weights (w) for the two-group case. For group one we calculate 


a= i = 
1, = Vom, 


The matrix V- and the vector of means for the first group are given in 
the preceding section and are 


vy 

m l 

x 4 y 
Y 6 5 


In addition, we calculate a constant which is half of the inner product of 
m; and |,. 


m l 

z $ 
m'h = $ 

2 0 


Let us suppose that we have the scores of a person on the two variables, 
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and that we wish to decide to which of the k groups (here k = 2) we 
should assign him. Let us suppose that his scores are 2 and 5. We first 
apply the weights l, to these scores and subtract the constant 3m,'h; 


(4) + 5(8) — FP) = § 


We then apply the weights I, to the same scores and subtract the con- 
stant $m,'l, 


2(3) + 5(0) — ($) = $ 


We assign the person to the group for which he obtains the highest 
weighted score, that is, in this case, to group two, since § > 4. 

However, this method assumes that the a priori probability of the 
person falling into a group is equal to l/g for every group; in other 
words, it assumes that we have no reason to think that he is more likely 
to fall into one group rather than into another. But to assume this is 
often to neglect useful a priori information. If we know, for example, 
that success in Science is less frequent than success in Arts, and if we 
can put a probability value on this knowledge, then we can refine our 
criterion of assignment. Suppose we know that 80 per cent of students 
can make good Arts graduates, while only 20 per cent can make good 
Science graduates. Then we add to a person’s weighted score for a par- 
ticular group the value log.p, where p is the a priori likelihood of a 
student belonging to that group. For the Arts group p is 0-80 and the 
natural logarithm of 0-80 is —0-2232. For the Science group p is 0:20 
and the natural logarithm of 0-20 is —1-6094, The scores of our hypo- 
thetical person, when corrected for a priori probabilities, are 


Group one 1:3333 — 0:2232 = 1-1101 
Group two 2:0000 — 1-6094 = 0:3906 


Since this person’s Arts score is now higher than his Science score, we 
shall advise him to take an Arts course. 

A good way to summarize a discriminant analysis is to draw up a 
contingency table. The rows of the table give the actual group to which 
a person belongs, and the columns give the group to which the analysis 
assigns him. The contingency table for the six persons in our example, 
without correction for a priori probabilities, is 


Predicted group 
Arts Science 
| aa 
Actual oes | tl? 
group 


Science | 1 | 2 
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The one person who appears to have been incorrectly assigned actually 
has the same discriminant score for both groups. He is the last person 
in the Science group. His score on the two-group discriminant function 
was exactly half-way between the Arts mean and the Science mean. 
Exact ties like this do not occur very often in practice. When they do, 
some arbitrary decision has to be made when the person is entered in 
the contingency table. 


Canonical analysis of discriminance 


Discriminant function analysis assumes that each of the samples is 
drawn from a separate population, and analysis of dispersion puts this 
assumption to the test. Both discriminant function analysis and analysis 
of dispersion assume that the populations, although they may differ in 
their means, do not differ in their dispersions. If we imagine each test as 
one axis in a rectangular set of axes, then every person may be repre- 
sented as a single point whose co-ordinates are his scores on the tests. 
The mean of a sample is also a single point, and is the centre of gravity 
of the person-points belonging to the sample. : 

Multivariate analysis assumes that the distribution of members ofa 
population about the population mean is multivariate normal, from 
which it follows that by joining together areas of equal density (chapter 
4) we should obtain an ellipsoidal surface, whose centre is the popula- 
tion mean. Thus every sample may be represented as an ellipse if there 
are two test axes, or as an ellipsoid if there are more than two test axes. 
It is assumed that the ellipsoids do not differ significantly from one 
another. This assumption may be broken down into the three assump- 
tions: the ellipsoids or ellipses (a) are the same shape, i.e. the lengths of 
the principal axes are in a constant ratio to one another in all samples, 
(b) are the same size, i.e. corresponding axes are the same length in all 
Samples, and (c) lie in the same orientation in the test space, i.e. the 
angle between a particular principal axis and the tests is constant from 
sample to sample. Assumption (b) is a stronger version of assumption 
(a). In some circumstances (b) may be abandoned, provided that (a) is 
satisfied. 

If there are no correlations among the tests, the ellipsoids can be 
reduced to spheres simply by standardizing the variables. In the usual 
case of correlated tests the reduction of the ellipsoids to spheres involves 
division by the within group dispersion matrix, which is equivalent to 
multiplication by the reciprocal of this matrix. The canonical analysis 
of discriminance is a transformation of the original test space which 
Teduces the within sample ellipsoids to spheres. For the sake of con- 
venience the radius of a sphere, which is the standard deviation within 
samples, is set equal to unity. Two-group discriminant function analysis 
is a special case of canonical analysis of discriminance, just as multiple 
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regression analysis with one criterion variable is a special case of canoni- 
cal correlation analysis. The essential identity between canonical cor- 
relation analysis and canonical analysis of discriminance will be illus- 
trated in the next chapter. . 

The nature of the transformation which the test space undergoes in 
a canonical analysis of discriminance may be grasped by imagining two 
test axes, at right-angles to one another, on the plane surface of an 
elastic rubber sheet. The samples are represented by two-dimensional 
ovals (ellipses) drawn on the surface. All the ovals are the same size and 
all point in the same direction. The positions of the ovals relative to 
one another are given empirically by the data, but the identity of their 
shape and orientation is an a priori assumption which must be satisfied 
if the transformation is to be successful. We now contract the rubber 
sheet until all the ovals become circles. This has the effect of rotating 
the test axes until the cosine of the angle between them is equal to the 
correlation between the two tests within the samples. Taking as our unit 
the distance from the centre of a circle to its circumference, we measure 
the distance between the centres of any two circles. This distance is 
Mahalanobis’ D for the two groups. Of course, in practice the original 
ellipsoids will not be perfectly ellipsoidal and they will not be exactly 
equal. The final spheres will not, therefore, be perfectly spherical and 
they will not all have exactly the same radius in any given direction. 
But the average of the squared radii, with each squared radius weighted 
by the within group degrees of freedom of its sample, will be unity. 

If there are only two ellipses, then the situation we have imagined is 
a geometrical picture of the two-group discriminant function analysis. 
The two-group discriminant function which we have already calculated 
is the line joining the centres of the Arts group and the Science group, 
each group being represented as a circle in the transformed space. A 
person’s discriminant score is the foot of a line, perpendicular to the 
discriminant function, which passes through his point in the trans- 
formed space. It has been demonstrated that a person may lie a long 
way from either group mean, and yet project on to the discriminant 
function somewhere within the circle of one of the groups. It may be 
suspected that such a person belongs to a third group which has not 
been included in the analysis. If we include a sample of persons belong- 
ing to this third group, we may represent it by a third circle; and we 
may calculate the distances D between the centre of this circle, which 
represents the mean of the third group, and the centres of the other two 
circles. Unless the third mean happens to lie on the line joining the 
other two means (extended in either direction, the third group need not 
lie between the other two), the three means will generate a two-dimen- 
sional space. We may define the first canonical variate of this space as 
the line of closest fit to the three means. The second canonical variate 
is then a line at right-angles to the first. 
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If only two tests are employed, the introduction of four or more 
groups does not increase the dimensionality of the space because all 
the means lie in the two-dimensional plane of the tests. If, however, the 
number of tests f is greater than the number of groups g, there will, in 
general, be g — 1 canonical vectors. If any group lies entirely within the 
space defined by the other group means, one of the possible canonical 
variates will disappear. If there are only two groups, then there is only 
one canonical variate, and that variate is identical with the line joining 
the two means and hence with the two-group discriminant function. 

The first canonical variate is the line which is such that the sum of 
squared deviations of the group means from the line is a minimum. It 
therefore resembles a first principal component, which is a line such 
that the sum of squared deviations of persons from it is a minimum. 
The first canonical vector, which defines the relations between the first 
canonical variate and the tests, is the first vector x, of the equation. 


B — ¢:W)x; = 0 
s of squares and sums 


where B and W are the between and within sums of 
of products matrices. The equation may also be written 


(WB — $:Dx; = 0 


The quantities 4, are latent roots of W-*B and are defined by the deter- 


minantal equation 
|W>B — $d] = 0 


An obvious difference between the first of these equations and the equa- 
tion for the extraction of latent vectors in a principal component analysis 
is that W has replaced the unit matrix I. In geometrical terms this means 
that we are squashing the ellipsoidal within group dispersions until they 
become spherical. The purpose of the transformation is to ensure that 
Variance shared by two or more tests does not count more than once in 
the discrimination of groups. 
In the present example the matrices are 


B w w-B 
6 DB | 4 4 10 20 
2 a| la 16| |o5 10 


As in canonical correlation analysis, we have to calculate the latent roots 
and vectors of a non-symmetric matrix. 
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Xı 
X 2 
bi 1 
$ 2 


The second root is zero because the first vector passes through the means 
of the two groups and so accounts for all their variance. The number of 
non-zero roots cannot be greater than the number of tests, or one less 
than the number of groups, whichever is the smaller. 

For the sake of simplicity, the weighting vector x, is reported in such 
a manner that its elements are simple whole numbers. This is possible 
because, it must be remembered, it is only the ratio of one element to 
another, not the absolute value of elements, which is determined by the 
equation for the extraction of latent vectors. The weighting vectors x; 
are applied to the test scores of individual persons in order to obtain 
their scores on the canonical variates. The mean canonical variate score 
of a group may be most directly calculated by applying the canonical 
vector to the means of the group on the tests. 

The within group variance of a canonical variate is given by 


Xx; Vx; 
which, in the case of x,, is 


ll 


As in the case of a two-group discriminant function, it is convenient to 
set the within group variance to unity 


x,/Vx; = 1 
This is achieved, in our example, by dividing each element of x, by 


v12. It is then found that the standardized canonical vector x, is 
identical with the standardized discriminant function, 


xy 
2 4 
£ JE ID 
y gh uw 
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The scores of the six persons on the standardized canonical vector are 
therefore exactly the same as their scores on the discriminant function, 
and the difference between the means of the two groups is D. 

If there are more than two groups in the analysis, the canonical 
vectors will not, in general, pass through any of the group means, as 
they do in the two-group case. The distance between two groups on a 
complete set of canonical variates, or on any subset, is calculated by 
squaring the difference between the group means on each variate and 
taking the square root of the sum of squares. 

It has been emphasized that discriminant anal, 
geneity of within group dispersions. It is not always practicable, or even 
desirable, to test the hypothesis of homogeneity by the method described 
in chapter 2. A useful check on the degree of heterogeneity displayed by 
the samples involves the calculation of the within group variance of each 
group on each standardized canonical variate. The weighted average of 
these variances is necessarily unity. If any of the variances deviates 
greatly from unity, it may be inferred that the original dispersions can- 
not be homogeneous. The formula for the variance of an individual 


group on a standardized canonical variate x; is 


ysis assumes the homo- 


x; V Xi 


where V, is the sums of squares and sums of products matrix of the 
group divided by its degrees of freedom. 

It has been suggested that the application of the chi square test of 
homogeneity of dispersion is not always desirable. It is reasonable to 
suppose that discriminant analysis and analysis of dispersion are suffi- 
ciently robust to withstand small discrepancies between dispersion 
matrices, 

The test of significance of the difference between means for th 
group case involves the calculation of 


e two- 


_ IWI 


Fee 
IT| 


Exactly the same criterion applies to the case of more than two groups. 
Using the improved chi square approximation given at the end of 
chapter 2, we obtain 


X£ = —[Da, — 1 — 8t + gllogL, 
a8 ag —1) 


This is a test of any and every difference among the group means. It 
tells us that there are significant differences; it does not tell us where 
they are. In any analysis of real data canonical analysis will extract 
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several dimensions of the distribution of group means, and it will ex- 
tract them in order of importance. We naturally, therefore, wish to test 
the significance of the dispersion of group means along each dimension 
in turn. This may be accomplished because the latent roots 4; have the 
property 


that is, L is equal to the product 


( ai iaei fa 


where there are q roots. A chi square may be calculated for each root 
by the formula 


xX = [Èn — 1 — 4(¢ + g)llog.(1 + 4) 
df. =t+g—2i 


where i is 1 for the first root, 2 for the second root, and so on. The chi 


squares for the roots sum to chi square for L. The associated degrees 
of freedom are likewise additive. 


The interpretation of these tests of significance is quite straight- 
forward. A significant overall chi square indicates that it is not the case 
that all the groups are drawn from a common population with respect 
to means. If the chi square for the first canonical variate is significant, 
and the sum of the chi squares for the remaining roots is not significant, 
then the distribution of group means does not deviate significantly from 
a straight line. If the first two roots only are significant, then the group 
means may be assumed to lie in a plane. If three roots are significant, 
then a three-dimensional space is required to accommodate the dis- 
persion of the groups. 

The non-significance of a root does not, of course, prove that the 
associated canonical variate is non-existent or negligible. What it im- 
plies is that, when the persons and the group means are projected on to 
the canonical variate, the differences among group means are small 
relative to the differences among persons within a group. With larger 
samples, the groups may be significantly separated along this dimension. 
If the existence of a dimension is of intrinsic interest because of its 
relevance to a theory, the research worker is quite justified in taking 
encouragement from the occurrence of the dimension, even though its 
associated root is non-significant. 

The chi square test of a canonical variate may be non-significant 
because most of the sample means on the variate are estimates of a 


— 


} 
i 
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common population mean, while a single mean stands rather apart 
from the rest. The sample means may be examined for this possibility. 
In the discussion of principal component analysis, it was pointed out 
that an examination should be made of the proportion of the squared 
variance of each person which is accounted for by each component. A 
similar examination should be made of sample means on canonical 
variates, with each mean expressed as a deviation from the overall mean 
of its variate. The squared distance of a group from the origin is the 
sum of squares of its means on the variates. The square of each mean 
may be expressed as a percentage of the squared distance from the 
origin. A particular variate may be of little importance to most of the 
groups, but of considerable importance to one group. 

It has been shown that, in the two-group case, a squared multiple 
correlation coefficient may be calculated as 


$ 


Perig 


Analogously, a canonical root* p; may be calculated for each canonical 


variate by the formula 
PEE 
Me 1+ 4: 
The square root of p; is the canonical correlation associated with the 
ith dimension of the dispersion of group means. , 
The tests of significance of the individual roots may be restated in 
terms of the p; The restatement is algebraically equivalent to the 
original statement, but it helps to illuminate the test and also to illus- 
trate the formal equivalence of canonical analysis of discriminance and 
canonical correlation analysis. The maximum value of a canonical root 
H is unity, and L is a measure of the extent to which the canonical 


roots fall short of unity: 


L=TL0-") 


This formula was given in the preceding chapter. The tests of the 
canonical roots are exactly the same as the tests employed in canonical 
correlation analysis, and they are also identical with the tests of the 
roots ¢;. The choice of formulae to be employed in any particular 
analysis is simply a matter of computational convenience. 


* w is a latent root of |B — «T| = 0 and 4; is a latent root of |W — ¥,T| 
= 0. The sets of latent vectors associated with these determinantal equations 
are identical with one another and with the set of canonical vectors associated 
with |B — $W] = 0. 

I 
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The test of L may be improved by employing Rao’s (1952, p. 261) 
F approximation. The chi square approximations for the individual 
roots may be improved by introducing Lawley’s (1959) modifications 
of the multipliers. 


Interpretation of a discriminant analysis 


The interpretation of a canonical variate requires a consideration both 
of the scatter of groups along it and of the weights of tests in the 
canonical vector which defines it. The last chapter contained a summary 
discussion of the difficulties involved in assessing the relative importance 
of tests in a regression equation. Exactly the same difficulties occur in 
discriminant function analysis and in canonical analysis of discrimi- 
nance, and for exactly the same reason. A discriminant function allots 
weights to variables solely according to their contribution to an opti- 
mum differentiation of group means. Only in exceptional circumstances 
can their contributions be regarded as independent and additive. Never- 
theless, the research worker insists that some attempt should be made 
to apportion a percentage weight to each variable. There is, perhaps, 
no harm in making the attempt, so long as the resulting coefficients of 
separate determination are not interpreted mechanically. 

In the case of a discriminant function for two groups, the weight of a 
test is determined by multiplying each element of w by the corresponding 


element of d and expressing the product as a percentage of D?. Thus the 
coefficient of separate determination of test X is 


4 
a 


x 


2 
Z£ = 0-50 


a 


which means that test X has a weight of 50 per cent in the differentiation 
of the groups. 

The general problem of assessing the relative importance of variables 
in a regression equation is discussed in chapter 10. 

If the interpretation of a canonical vector or discriminant function is 
based on an examination of the weights of the variables in the regression 
equation, the research worker must beware of a dangerous trap. Dis- 
criminant analysis is almost always applied to the raw, unstandardized 
variables. Other things being equal, the variable which has a small 
variance will be given a large regression weight, and the variable which 
has a large variance will be given a small regression weight. The analysis 
may, of course, be applied to standardized scores. One possible 
standardization is achieved by dividing each score on a test by the total 
standard deviation of that test. The square of the total standard devia- 
tion is obtained by dividing the total sum of squares by the sum of the 
number of persons in all the groups. If all the standardized variables 
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have the same weight in a regression equation, then the weights of the 
non-standardized variables will be inversely proportional to the total 
standard deviations. An example at the beginning of the next chapter 
illustrates the conversion of regression weights for non-standardized 
variables to weights for standardized variables. 

The employment of raw rather than standardized scores in discri- 
minant analysis is a matter of computational convenience. The primary 
purpose of the method is to derive a rule for the allocation of objects 
or persons to classes. The rule is most convenient if it can be applied 
directly to raw scores. None of the properties of the standardized 
canonical variates (expressed as deviations from their means) is altered 
by standardization of variables. In particular, the distances between 


groups remain unchanged. 
In the last chapter an example was given of a multiple regression 


analysis in which the scores on tests were replaced by scores on com- 
ponents. Exactly the same substitution may be made in discriminant 
analysis. 

A property of canonical vectors which may cause some puzzlement 
is the fact that the inner product of two vectors is not, in general, zero 
(see the example in the next chapter). This follows from the fact that 
the vectors are latent vectors of a non-symmetric matrix. It is conven- 
tional to represent the orthogonality of principal components by the 


formula 
f,=0 i#žj 


which implies that two components are uncorrelated because the latent 
vectors by which they are produced are orthogonal. The equation holds 
when f; and f; are latent vectors of a symmetric matrix A. It might be 
written, more generally, 


f/Af, =0 


Canonical variates are not principal components of either W or B, 
and so canonical vectors do not have the property x,'x; = 0; but they 
do have the dual property, 

x,/Wx; = 0 
x; Bx; = 0 


that is, the average within group correlation of any two canonical variates 
is zero, and the correlations of the group means on any two variates is 
zero. In so far as the dispersion within every group is rightly supposed to 
be an estimate of a common dispersion and contributes to W, then the 
Correlation between two variates within any individual group tends to 
zero. Because the canonical variates are uncorrelated within and between 
groups, they are uncorrelated over all persons, regardless of grouping, 
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x//Tx, = 0 


The three formulae of the last Paragraph state that the sum of 
products of any pair of canonical variates is zero. The following for- 
mula states that the maximized quantity ¢ is a ratio of sums of squares: 


$i = Z Bi 

x; Wx; 

The apparent analogy with univariate analysis of variance may be made 
explicit by treating a canonical variate as a single variable, rather than 
as a weighted compound of the original variables. It is customary to 
standardize a canonical variate so that its within group mean square 
(variance) is unity, 


x,'Vx; = 1 


If this has been done, the within group sum of squares is equal to the 
within group degrees of freedom, which we may call w. The value of 
w* is calculated in the usual way, w = >(n, — 1). The between groups 
degrees of freedom are b = g — 1. An ordinary analysis of variance 


table may be drawn up from the knowledge that w¢ gives us the between 
groups sum of squares, 


Source d.f. SS MS 
Between b wọ zg 
Within w w 1 
Total b+w wd +1) 


$ is the ratio of the between to the within sum of squares. The canonical 
root p is the ratio of the between to the total sum of squares, 


= wo _ ġġ 
ne F IFS 
In ordinary univariate analysis of variance this ratio is known as the 
square of the correlation ratio eta (Fisher, 1958, p. 255). 

In some studies it may be appropriate to examine the scatter of 
groups along a particular canonical variate by allotting the between 
groups degrees of freedom to specially-designed comparisons. This sort 


* w has nothing to do with the weighting vector for which the same letter was 


used at the beginning of the chapter. 
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of analysis was suggested in chapter 4, in the discussion of principal 
components. Such an approach may aid the interpretation of a canonical 
variate. On the other hand, considered as a test of significance, it will 
often provide only a rough approximation to the purpose with which 
the research worker embarked on the discriminant analysis. A test of 
significance applied to a designed comparison on a particular canonical 
variate implies that the research worker has set out with a double hypo- 
thesis. First, he has defined a dimension a priori in terms of variable 
weights, and he has identified a certain empirically-derived canonical 
vector with this dimension. Secondly, he has hypothesized a particular 
ordering of groups along this dimension. Few research workers are as 
precise and bold as this. The method of designed comparisons illus- 
trated in the following chapter is closer to the usual state of mind of 
the worker who wishes to use a judicious combination of a priori 
assumption and empirical inference. 

Throughout this chapter we have employed equal-sized samples from 
each of our populations. This is clearly an efficient procedure, whatever 
the relative a priori frequencies of the populations, because it yields 
equally good estimates of the multivariate means of all populations. If, 
however, the number of persons is not constant from sample to sample, 
the research worker is faced with a choice. Having obtained his canonical 
vectors, and having calculated the mean of each group on each canonical 
variate, he will wish to graph the position of groups on pairs of variates. 
If he wants the origin of his co-ordinate system to lie at the zero point 
of each variate, he must locate these zero points. To do this, he can 
either add up the means of all samples on a variate and divide by the 
number of samples, or he can apply the canonical vector to the overall 
means of all persons on the original tests. Both calculations give the 
same result when samples are of equal size. They give different results, 
in general, when samples are of unequal size. The first method gives the 
unweighted mean of the groups on a variate, the second method gives 
their weighted mean. 

The diflerence between the weighted and unweighted origins affects 
the position of the axes relative to the group means. It does not affect 
the position of the group means relative to one another. If one of the 
Purposes of the study is to identify extreme groups, then choice of origin 
may be important. The obvious measure of the relative extremeness of 
groups is distance from the origin. 

„Just as we may cluster persons by means of a taxonomic analysis of 
distances between pairs of persons, so we may cluster groups by a 
taxonomic analysis of distances between pairs of groups. The former 
analysis yields classes, and the latter yields classes of classes. The 
research worker must decide whether to weight the groups in the 
taxonomic analysis. If he does weight them, he will probably use a 
Priori frequencies rather than sample sizes. The question of weighting 
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manifests itself when two or more groups are combined into a single 
class. In calculating distances from this higher-order class to other 
classes, the weighted and the unweighted means of the higher-order 
class both offer themselves as plausible estimates of its position. 

A polar co-ordinate representation of canonical vectors may be 
drawn, using exactly the same methods as those used in the preceding 


chapter. The following chapter should make clear the application of the 
formulae to discriminant analysis. 
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Chapter8 Imposition of Structure 


Let us suppose that a research worker has taken two measurements for 
each of a number of persons, and each person is known to belong to 
one or other of four groups. All four samples are the same size, each 
containing three members. The four within group dispersion matrices 
are found to be identical. The scores of the twelve persons on the two 
tests are as follows: 


Group tı tə 
9 10 

A 10 8 
11 12 

7 10 

B 8 8 
9 12 

5 6 

(S 6 4 
a 8 

3 2 

D 4 0 
5 4 


Canonical analysis of discriminance 


If he has some knowledge of multivariate analysis, the investigator will 
almost certainly calculate the vectors which discriminate the groups; 
that is, he will perform a canonical analysis of discriminance. (We shall 
assume that the purpose of the research does not call for an analysis of 
Covariance or some other approach.) Because the arithmetic is compli- 
cated, and because the methods have all been explained in detail in 
Previous chapters, we shall not give details of the calculations but simply 
Concentrate on the results. 

With only two tests the research worker can obtain only two canonical 
Toots and vectors, The canonical roots may be reported in any of the 
three forms $, m, or w (chapter 7). For some purposes j is the most 
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informative quantity, because it is the square of a multiple correlation 
coefficient. The canonical vectors may be reported in standardized form, 
that is, so that the average within group mean square of the canonical 
variates is unity. The canonical roots and vectors are then 


Xi Xe 

ti 0-8302 —0-8025 
ta 0-1400 0-5601 
E 0-8881 0:3619 


The values of p tell us how good the separation of the groups is along 
the line of the two canonical variates, the higher the value the better the 
separation. The elements of the x; are the weights which must be applied 
to the raw (unstandardized) scores of a person in calculating his position 
on a canonical variate. 

At this point the analysis often stops. The research worker, exhausted 
by the calculation and inversion of matrices, and apprehensive about 
the accuracy of his arithmetic, feels that so much labour must have 
produced an interpretable result, and so he draws up his table of canoni- 
cal vectors and summarizes it in words as best he can. If he is a little 


more cautious he takes two further steps, which require only a few extra 
calculations. 


The first step is to calculate the 
dardized canonical variate. The mea 
overall mean, are 


mean of each group on each stan- 
ns, expressed as deviations from the 


Group vy Vo 
A 2:91 —0:73 
B 1-25 0:88 
E —0:97 0-24 
D —3-19 —0-39 


The neat pattern of the groups on the first variate, w 
more or less equidistant from the next, suggests tha 
structure has been revealed by the analysis. Only an ex: 
actual material could tell us whether this is so. 

The second step involves, in effect, a standardization of the tests. 
Discriminant function analysis is usually carried out on raw scores. The 
purpose of the technique is to allot persons to groups, and the calcula- 
tions are greatly simplified if the discriminant functions can be applied 
direct to the raw scores. But, from the point of view of interpretation, 
non-standardized variables are a nuisance, because the weight of a vari- 


ith each group 
t a meaningful 
amination of the 
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able varies inversely with its total standard deviation, that is, with the 
square root of the total sum of squares divided by the number of per- 
sons.* The canonical vectors which apply to the standardized scores 
may be achieved by multiplying each element of x; by the total standard 
deviation of the appropriate test t; We shall perform this calculation 
for our example and then we shall normalize the resulting canonical 
vectors. Normalization is desirable if two weighting vectors are to be 
compared with one another. Let us call the total standard deviation s 
and the resulting normalized vectors az. 


First canonical vector 


Xı s SX; a 
t 0:8302 2-3805 1-9763 0:9674 
tz 0-1400 3-6968 0-5176 0:2532 
Second canonical vector 
Me s SXo a 
t —0:8025 2-3805 —1:9104 —0:6781 


tz 0-5601 3-6968 2:0706 0:7350 


Canonical correlation analysis 


It was said above that the means of the groups showed a neat pattern 
on the first canonical variate. Let us, however, suppose that the pattern 
is a distorted reflection of the structure which the investigator would 
expect to find. The experimenter suspects that the first canonical variate 
is an amalgam of dimensions which he wishes to distinguish from one 
another. The problem is very similar to the problem facing the factor 
analyst who wishes to rotate his principal axes, or break them down 
into group factors. The research worker who collected the data of our 
example is more fortunate than the factor analyst, in that he has pro- 
vided himself with an alternative framework which will serve as a 
criterion for rotation of variates. The grouping of the subjects is inde- 
pendent of the t; variables, and the research worker may be able to 
specify particular combinations of groups which, he feels, will provide 
in with theoretically interesting dimensions of the discriminant space. 

e proposed analysis is a generalization of the method of designed 
comparisons in the analysis of variance to the multivariate case. Before 
illustrating the multivariate application of designed comparisons, we 
shall carry out another analysis which shows the essential equivalence 


* uain 42 an 
One advantage of the coefficient of separate determination is that it is 
unaffected by the omission of a preliminary standardization. 
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of the canonical analysis of discriminance and the method of canonical 
correlation analysis. ; 

The canonical analysis of discriminance of standardized scores is 
formally identical with a canonical correlation analysis in which one set 
of variables consists of the tests which are employed in the discriminant 
function analysis, and the other set is a set of dummy variables. There 
are several ways of constructing dummy variables. For the purpose of 
comparing the discriminant analysis with a canonical correlation analy- 
sis, we require dummy variables each of which contrasts membership 
ofa particular group with membership of any of the other groups. Such 
a variable takes only two values. If a person is a member of group X, 
then his ‘score’ on the variable is, let us say, 1; if a person is not a 
member of X, then his score is, let us say, 0. The data of our example 
are set out below with a set of dummies d;. 


Group t tz d à dà d, 
9 10 1 0 oO oO 

A 10 8 1 0 0 0 
11 12 1 0 0 0 

7 10 0 1 0 0 

B 8 8 0 1 0 0 
9 12 0 1 0 0 

5 6 0 0 1 0 

|5; 6 4 0 oO 1 0 
7 8 0 0 1 0 

3 2 0 oO @ 1 

D 4 0 0 0 0 1 
5 4 0 0 0 1 


It will be recalled that, with four groups, only three between groups 
degrees of freedom are available. If the four dummy variables are cor- 
related with one another, and the latent roots of the correlation matrix 
are calculated, it will be found that one root is zero, from which it 
follows that any one of the dummies is a weighted combination of the 
other three.* We may therefore omit one of the three dummies—it does 
not matter which—without altering the weights which the canonical 
correlation analysis gives to our original variables t;. If we omit the 


* This point may be more accurately expressed by defining a dummy d, in 
which every element is unity. d, takes up a fourth degree of freedom and 
represents the fitting of a general mean. Then the fourth dummy, for example, 
may be calculated as d, = d, — (dı + d; + də). 
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fourth dummy and calculate canonical roots and vectors, normalizing 
the vectors separately for the t; and the d;, we obtain 


a az 
t 0-9674 —0:6781 
ty 0-2532 0-7350 

b; bə 
d 0:7756  —0-2288 
d 0:5645 0-8707 
d; 0:2823 0-4353 
u 0-8881 0:3619 


It can be seen at once that the squares of the canonical correlations, pi, 
and the canonical vectors for the t; variables are exactly the same as the 
canonical roots and vectors of the discriminant analysis of standardized 
scores. The new information provided by the canonical correlation 
analysis is the weights of the dummies. These weights are of no intrinsic 
interest because they vary according to which three of the four dummies 
were employed. However, quite apart from this instability, which results 
from the fact that the number of degrees of freedom is less than the 
number of groups, a problem arises because the first canonical vector 
is chosen according to a purely arithmetical criterion, namely the cri- 


terion of minimizing the squared deviations of the group means about 
the regression line. It is because the criterion is based on arithmetical 
that difficulties of interpretation 


rather than theoretical considerations i 
are only too common in canonical analysis. The method of designed 
comparisons is proposed as a means of overcoming them. 


Designed comparisons 


The dummy variables for the designed comparisons must reflect the 
investigator. If he believes 


hypotheses or assumptions entertained by the inves 
that the groups may be equally spaced alongsome dimension of the space, 
but may also show non-linear deviations about this line, he might employ 
the following pattern of dummies, A. Each person in a group has the 


same score on each dummy. 


Pattern A 
Group d, d d; 
A —3 1 =i 
B —1 —l1 3 
C 1 —l —3 
D 3 1 1 
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An examination of the means of the groups on the first variate of the 
canonical analysis of discriminance tells us that d, must be a good 
approximation to the already obtained first canonical vector. The ex- 
perimenter, however, is not interested merely in the empirical relations 
among his groups; he wishes to know if they are consistent with a 
particular set of laws. A given empirical relation may be compatible 
with more than one set of assumptions. A linear ordering of group 
means may be a consequence of a factor which is present in different 
degrees in the four groups, or it may arise from an additive combination 
of a pair of factors represented by the main terms of a factorial design 
in the analysis of variance. Let us. for purposes of illustration, adopt an 


alternative approach and define the dummies appropriate to a factorial 
design as follows: 


Pattern B 
Group d, d, d; 
A -1 =l =l 
B =l 1 1 
C 1 —1 1 
D 1 I =I 


We now perform a canonical correlation analysis in which the t; and 


the d; of pattern B are the two sets of variables. The canonical roots and 
canonical vectors are 


a, az 
t 0:9674 —0-6781 
t, 0-2532 0:7350 
b, b, 
d, —0-9046 —0-1225 
d; —0:4219 0:3941 
d; 0-0609 0:9108 
L 0:8881 0:3619 


The regression weights of the t; variables in the two canonical vectors 
remain unchanged. This is inevitable so long as we continue to work 
exclusively and exhaustively within the space defined by the group 
means. A regression line is unchanged under rotation of axes within 
the complementary space, so long as that space is not mutilated (see 
chapter 6). The line is also unchanged under rotation of axes within its 
own space. If pattern A is substituted for pattern B, no difference is 
made to the scores of persons on the two canonical variates. 
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The first canonical vector gives roughly twice as much weight to d, 
as to dą, and virtually no weight to ds. We can see why this weighting 
emerges by multiplying each element of d, by 2, and adding the corre- 
sponding element of dy. 


Group 2d, dz Sum 
A —2 —1 —3 
B —2 1 —l1 
c 2 —1 1 
D 2 1 3 


The result is identical with the d, column of pattern A. There is nothing 
surprising in finding that the first canonical vector is closely approxi- 
mated by a linear ordering of group weights, because we know already 
that the group means are ordered linearly along the dimension of the 
first canonical variate. Changing the reference axes does not make the 
slightest difference to the mean scores of the groups on either of the 
variates. Experimenters should be chary of making inductive generaliza- 
tions on the basis of maximization criteria alone. It is often advisable 
to test the fit between the data and several sets of structural assumptions. 

The second canonical variate is primarily related to the third designed 
comparison and secondarily to the second comparison. The relative 
importance of the three a priori dimensions of the hypothesized structure 
(pattern B) may be gauged by calculating the correlation matrix of the 
dummy variables before and after the elimination of the canonical 


variates. 
Before After 
d, d, d, d, da d; 
dı 1-0000 0 0 0:2679 —0-3214 0:0893 
d, 0 1:0000 0 —0:3214 0:7857 —0:1071 
do 0 0 1-0000 0:0893 —0-1071 0-6964 


Most of the variance of the first dummy variable has been accounted 
for by the analysis. This means that the factor which manifests itself in 
the contrast between groups A and B on the one hand, and groups C 
and D on the other, is the most important determinant of the scatter of 
the groups on the t; variables. The other factors are less important. 


The structure of regression 


It should be evident that the introduction of a priori dimensions, defined 
in terms of contrasting group-membership, considerably illuminates the 
nature of the transformed space in which the group means have their 
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maximum scatter.* The analysis, may, however, be carried further with 
a promise of even greater insight. The extension is possible because the 
designed comparisons are orthogonal. This orthogonality is apparent 
in the zero correlations between the dummy variables before the analysis 
begins. The conditions of zero correlation are two: if the numbers of 
persons in the groups are equal, then (a) the sum of the elements of each 
dummy vector must be zero, and (b) the product of the transpose of 
any dummy vector and any other dummy vector must be zero, 


d;d; = 0 for all į and j, i Æ j 


These two conditions apply to the elements of the vectors expressed as 
deviations from their means. It would be quite possible to replace — 1 
by 1 and 1 by 2 in any or all of the dummies without violating the 
conditions. 

If the numbers of persons in the groups are not equal, the conditions 
become more complicated and often more difficult to satisfy in practice. 
For the case of unequal numbers, we define dy; as the kth element of 
vector d,, where k takes values from 1 up to the number of groups. We 


define 7, as the number of persons in the kth group. Then the dummies 
d; must satisfy the conditions, 


(a) Sd; = 0 for all i 
(b) Srdyidi; = 0 for all i and j, i + j 


* There is no reason why a priori ‘marker’ dimensions should not be employed 
in studies which at present terminate in a factor analysis. Various approaches 
to the employment of canonical correlation analysis are possible. If the marker 
dimensions are truly a priori, it should be possible to ensure that they form an 
approximately orthogonal set. If they are measured variables, then it may be 
desirable to orthogonalize them by replacing a person’s scores: on theicor 
related variables by his scores on the uncorrelated principal components. (A 
similar orthogonalization may be carried out on either or both sets of varjables 
in any analysis, v. chapters 6 and 10.) If it can be assumed that the variance 
specific to a set of variables is irrelevant to the analysis, then the variance 
common to the two sets may be subjected to a principal component analysis 
at the conclusion of the canonical correlation analysis. A polar co-ordinate 
analysis is feasible so long as each variable in the analysis retains a roughly 
constant proportion of its variance. The process may be adapted to yield a 
solution to the communality problem. 


+ This possibility is mentioned here because the computer programme ‘Re- 
gression’ in the Handbook of Multivariate Methods Programmed in Atlas 
Autocode (in the hardback edition of this book) treats —1 as signifying ‘not 
known’. Minus quantities must therefore be avoided in the construction of 
dummy variables. If it is desired to make all elements of a dummy positive, 
a suitable constant should be added to every element. 


ae 
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Even when the arithmetical values of truly orthogonal dimensions are 
difficult to come by, it is usually possible to introduce approximately 
orthogonal d; and then to replace them by their principal components. 
It may occasionally be useful to adopt the method of designed com- 
parisons to deal with cases where the group-membership of certain 
persons is doubtful. 

The conditions of orthogonality have been set out in detail because 
orthogonality is a valuable property of a set of a priori dimensions, 
since it enables the experimenter to make yet one further analysis of his 
data. Because our example employed only two tests, we could, neces- 


sarily, obtain no more than a two-dimensional canonical space. But we 


have four groups and therefore three between groups dimensions, It is 
possible to take each of these dimensions separately and to relate it to 
the variables t; by means of a multiple regression analysis. The values 


of R? for the three analyses are 


Dummy variable Ra 
d, 0:7321 
d, 0:2143 
d; 0:3036 
Sum 1:2500 


It can be seen that these three squared multiple correlations are slices of 
the overall regression. Whereas formerly we had two squared canonical 
correlations (0-8881 and 0:3619) which summed to 1:2500, we now have 
three squared correlations, summing to the same value, and each is 


uniquely related to one a priori dimension. Furthermore, each multiple 
regression exhausts the total available predictive variance of each dum- 
After matrix from the 


my. This becomes evident if we subtract the i h 
f the correlation matrix 


Before matrix on page 131 to give us that part o i 
of the dummy variables which is accounted for by regression. The for- 


mula for the part of the correlation matrix attributable to regression is 
RaRiyR,q, where the subscripts refer to the t; and the d; variables. 


qd, d: ds 
d, 0:7321 0:3214 —0:0893 
da 0:3214 0:2143 0:1071 | 
d; —0-0893 0-1071 0-3036 


Each diagonal element of this matrix is equal to the value of R? for the 


regression of the corresponding dummy. 
The analysis shows that, by introducing orthogonal a priori dimen- 
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sions, it is possible to partition the regression variance into independant 
slices, just as the factor analyst reduces a correlation matrix by peeling 
off layer after layer, each layer being accounted for by a single factor 
or component (cf. Burt’s, 1938, exposition of the properties of he 
‘unit hierarchies’). A further consequence of the orthogonality of - 
dummy variables is that it becomes possible to apply an identical a 
titioning process to the t; variables. The two t; variables ae 

0:8712. The part of their correlation matrix which is accounted for by 


: P: à ` x ion 
the regression on all three dummy variables in the canonical correlatior 
analysis is 


t, Å 
ty 0-8824 0-7954 
tə 0:7954 0-8049 


-l 
This is the matrix which is calculated by means of the formula RiaRaaRar- 
We may apply the same formula to the regression of the t; variables on 


each dummy variable in turn. The resulting matrices, which sum to the 
above matrix, are 


Dummy variable Matrix 
0:7059 0-6818 


! 06818 0-6585 


0:1765 0-1136 
0:1136 0-0732 


0:0000 0-0000 
0:0000 0:0732 


Each of these matrices is a u 


nit hierarchy in the sense that it has one 
and only one principal com 


ponent. Examination of the third matrix 
reveals that the variable t, is, to the extent of 7-32 per cent of its vari- 
ance, in specific association with the third a priori dimension. Informa- 
tion of this nature may be very valuable in, for example, a genetic study; 
and it is unwise to neglect an association simply because it accounts for 
only a small part of the variance of a variable. 

The percentage of the variance of a set of variables which is accounted 
for by a particular regression may be calculated by summing vee 
elements in the principal diagonal of the appropriate matrix, multiplying 
by 100, and dividing by the number of variables in the set. For example, 
the percentage of the variance of the standardized t; variables which 2 
accounted for by the first a priori dimension is (0-7059 + 0-6585)100/ 
= 68-22%. The percentage which is accounted for by all three dummy 
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variables in the canonical correlation analysis is (0-8824 + 0-8049)100/2 
= 84-36%. The percentage of the variance of a single dummy variable 
which is accounted for by a multiple regression analysis is obtained by 
multiplying R? by 100. Thus we obtain the following table of per- 
centages: 


Dummy variable Percentage of t; Percentage of d; 
1 68-22 73-21 
al 12:48 21-43 
3 3-66 30-36 

1, 2 and 3 84-36 41:67 


s in this table. One is obvious: the first three rows of 
h, while the first three rows of the d; 
alue in the fourth. The discrepancy in 
nce of the fact that each 100 R? is 
as the 41-67 per cent is an average 


There are two trap: 
the t; column add to the fourt 
column add to three times the v 
the d; column is a simple conseque 


derived from a single variable, where 
derived from three variables. No real problem arises from the dis- 


crepancy. The second trap, however, is a consequence of the fact that, 
when the analysis began, the dummy variables were orthogonal, whereas 
the t; variables were not. The total of the d; variance accounted for is 
3 x 0-4167 = 1:25, which is the sum of the canonical roots and also 
the sum of the R?. A similar relation does not hold for the t; variables 
because they do not constitute an orthogonal set. 

Although the dummy variables are orthogonal when the analysis 
begins, their projections into the space of the t; variables are not ortho- 
gonal, The correlation between the projections of two dummies may 
be calculated from the elements of the matrix RyRy Ria For example, 


the correlation between d, and d; is 


ee) 
0:7321 x +/0-2143 i 


Theoretical explanation and empirical relation 


At the beginning of the section ‘Designed comparisons’ two possible 
schemes of dummy variables were suggested, and one was rejected in 
favour of the other. It should be emphasized that the basis for choosing 
one set rather than the other was supposed to be a theoretical and not 
an empirical superiority. In fact, both sets of dummies exhaust all the 
between groups degrees of freedom and both provide exactly the same 
degree of fit to the data. The overall percentages of variance accounted 
for are the same, and the identities which have been illustrated for one 


K 
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set hold for the other. The only empirical difference between the two 
sets lies in the distribution of weights among the three dummies. Indeed, 
since d, of pattern A, the second dummy of the set we did not use, is 
identical with d, of pattern B, the third dummy of the set we did use, 
the only difference lies in the distribution of weights between the re- 
maining two dummies in each set. Whereas our illustrative analysis 
allotted a weight of 0-7321 to d, and 0-2143 to dy, the alternative 
analysis would give a weight of 0-8857 to d, and 0:0607 to dy. If the 
experimenter is eager to simplify the empirical relations between groups 
at the cost of omitting one degree of freedom, then the alternative 
analysis would give a better fit, in the sense that less between groups 
variance would be lost by the omission of d, of pattern A than by the 
omission of d, of pattern B. 

The results of an analysis by canonical correlations can be broken 
down by calculating the regression of each variable in a set (let us call 
it the set of y variables) on all the variables of the other set (the x 
variables). Even if the y variables are not orthogonal, the sum of the 
R? is equal to the number of y variables multiplied by the proportion 
of the variance of the y variables which is accounted for by regression 
in the canonical correlation analysis. The remaining identities, however, 
no longer hold when the y variables are correlated. In particular, the 
matrix of correlations among the x variables which is attributable to 
regression does not split into additive unit hierarchies, the canonical 
roots no longer add to the sum of R?, and the percentages of the x 
variables accounted for by the multiple regression analyses do not sum 
to the percentage accounted for by the canonical correlation analysis. 


Formal equivalence of methods 


The equivalence of the methods of analysis of dispersion, canonical 
analysis of discriminance and canonical correlation analysis may be 
summed up by describing the rationale of the methods both in the terms 


of analysis of variance and in the terms of regression analysis. In an | 


analysis of variance the total variance is split up into variance which 
can be accounted for by a hypothesis and variance which is error vari- 
ance or, to employ a word used by information theorists, noise. The 
variance attributable to the hypothesis is equivalent to the variance due 
to regression, and is the between groups variance. The within group 
variance, which represents error, is equivalent to variance due to devia- 
tion from regression. Just as we split total variance and covariance into 
between and within variances and covariances, so we split the overall 
correlation matrix R,, into that part which can be accounted for by 
regression, R,,R,,—~R,,, and that part which represents deviation from 
regression, Ras —R,,R,,-'Ry». The determinantal equation of canonical 
correlation analysis, 


m 
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|RayRuy Roz — MRa) = 0 


is equivalent to one of the formulae of canonical analysis of discri- 


minance, 
|B = T| =0 


. The practical advantage of regarding analysis of variance as a regres- 
sion method is that it becomes easy to see how to lay down a priori 
regression lines by the simple expedient of defining dummy variables, 
thus partitioning the between groups variance into additive slices. It is 
possible to apply the canonical analysis of discriminance and canonical 
correlation analysis to either raw or standardized scores. For many 
purposes an adequate assay of a body of data involves the application 
of discriminant analysis to the raw scores and canonical correlation 
analysis to the standardized scores, followed by an exploration of the 
regression space under the guidance of a hypothesized structure. 
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Chapter9 Analysis of Covariance 


The analysis of covariance is the most complicated of the standard. 
parametric statistical methods. It is complicated because it involves the 
simultaneous employment of the concepts of analysis of variance and 
regression analysis. Its conceptual complexity is compounded by the 
arithmetic jungle which sprouts on the pages of textbooks when data 
on several variables are analysed without use of matrices. mee 

Analysis of covariance resembles analysis by canonical correlations in 
that it deals with the regression of one set of variables ( y say) on another 
(x). There are two major differences between the two techniques. In 
canonical correlation analysis the x and y variables are interdependent 
and have the same logical status. The analysis may be reversed without 
any essential change in its structure. In analysis of covariance, on the 
other hand, the x variables are covariance variables and the y variables 
are dependent. The second major difference between the two methods 
is that, like an analysis of variance, the analysis of covariance incor- 
porates one or more treatment or comparison variables, which are 
represented by the allocation of persons to two or more groups. In a 
canonical correlation analysis every person is treated as if he belonged 
to one and the same group. 

The general case of canonical correlation analysis is one in which 
there is a plurality of x variables and a plurality of y variables. A 


multiple regression analysis is a special case in which one set of variables 


sets have only one member. The same distinct cases occur in analysis 
of covariance. It is convenient, however, to distinguish only two cases, 
the multivariate case in which either or both of the two sets may be 
plural, and the bivariate case in which there is only one x variable and 
one y variable. The equations of the method must be interpreted in 
matrix or vector terms for the multivariate case, and in scalar terms 


Analysis of covariance is more complex than canonical correlation 
analysis because it includes a third type of variable, over and above the 
independent x variables and the dependent y variables. This third type 
is the treatment, grouping, or designed comparison variable, in the 
analysis of variance sense. Instead of all persons belonging to a single 
group, as did the psychotherapist’s patients at the beginning of chapter 
6, they have been assigned to two or more groups. In the chapter on 
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univariate analysis of variance we saw how an individual's score on a 
variable (in the sense of his deviation from the overall mean of all per- 
sons on that variable) may be broken down into a number of elements 
or terms. Take, for example, the deviation score of the second person 


A the first group on page 25 and call it yı = —1. This score is made 
por two parts,* a treatment or comparison element for the first group, 
cı = —2, and an error element for that person e1: = +1, 


Jig = C1 + Cre 

The equation is in principle the same when more than one factor enters 
into the design of the experiment. In a two-factor design such as that 
on pages 33 and 34 the treatment effect breaks down into three com- 
parisons, one for sex, one for constitution, and one for the interaction 
of sex and constitution. Values of the terms of the equation for the 
sex x constitution analysis may be calculated by analysing the group 
means exactly as the person X test X occasion data were analysed, 
further on in the same chapter. The means of the groups are 


Constitution 
Normal Abnormal 


Men 
Sex | S | 
Women | 1 3 
apressa as deviations from the overall mean 2}, these group means 
re 
Mean 
3 
—ł} 


Mean 


e and analysis of covariance 
s or components which represent true or 
uations given in the text represent 
d to break down the sample scores 
led by the mathematical formulae 


* . + R 
The equations or models of analysis of varianci 


are made up of additive element 
population values (parameters). The eq 
Simple algebraic identities which are use 
Into additive parts. Readers who are puzzi 

Of the models will probably be helped to appreciate their properties by dis- 
membering an actual score. But they are warned that there is no necessary 


One-to-one relationship between an element of a score and a term in the model. 
ions which are satisfied by 


ae nature of the relation depends on the assumpt s 

e experimental design. Eisenhart (1947) gives a very clear exposition of the 

possible relations, and Plackett (1960) indicates the modifications to earlier 
e application of models. 


theory which resulted from experience in th 
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The value of the Sex comparison for men may be written S, = $. 
where s stands for sex and m stands for men. The value of the same 
comparison for women is s,, = —4. Similarly, the Constitution com- 
parison takes the two values c, = } and Ca = —}. Subtracting these 


values from the appropriate cells of the table leaves us with the table 
of interaction elements, 


which may be symbolized 


(Smen) (Sma) 


(uln) (Sua) 


Now let us take the score of a particular individual and express it as 
a deviation from the overall mean of 2}. The deviation of the first ab- 
normal male, which we shall call Ymar is O — 2} = —2}. The error 
component of this score is the deviation of the score from the mean of 
the abnormal males, ema = 0 — 1 = —1. Now we may build up the 
analysis of variance equation for this man as follows: 


Yma = Sm + Ca + (SmCa) + ema 
=2 = (8) + (9 + (-14) + (—1) 


The equation can be made as complicated as we wish by the intro- 
duction of further factors or designed comparisons, but basically the 
right-hand side consists of only two parts. The first is the between 
groups part which is attributable to treatments or comparisons, and the 
second is the within group part which is attributable to error. Analysis 
of covariance introduces a third type of element into the equation. 
namely the element of regression on an independent variable or set of 
variables x. The score of the jth person in the ith group is composed of 
three parts 


Yia = Ci + bxy + ey 


Here x;; is the score of the person in question on an independent variable 
X, expressed as a deviation from the overall mean of x. If there is more 
than one independent variable, the term X;; stands for a vector of devia- 
tion scores. The regression coefficient b is the error or within group 
regression of y on x, and may be written more explicitly b,» In the 
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multivariate case b is replaced by a y X x matrix of regression Co- 
efficients, where y is the number of y variables and x is the number of 
x variables. In the multivariate case the terms c; and e;; also became 


vectors. 


Two uses of analysis of covariance 


Analysis of covariance is a powerful, and often a dangerous and mis- 
leading, technique. There are almost as many types of analysis as there 
are experiments in which it is employed; this means that the research 
worker who uses the method must think hard about what he is doing. 
Nevertheless, among all the variants, we may distinguish two modes of 
employment, which may be called the experimental and the observa- 
tional. 

In the experimental employment of analysis of covariance, persons 
are allotted to treatment groups at random, and the independent vari- 
able x cannot be affected by treatment. One way of ensuring that the x 
variable or variables are uncorrelated with the treatment effects is to 
measure them before treatment has taken place, and even before persons 
have been assigned to groups. Although the x variables are not cor- 
related with the designed comparisons, they may be correlated with the 
y variables. In so far as an independent variable and a dependent vari- 
able are correlated within groups, it is possible to increase the sensitivity 
or precision of the experiment by removing the effect of the x variable, 
which would otherwise inflate the estimate of error. In effect, this 
amounts to performing an analysis of variance on the y variables after 
covariance adjustment for the x variables, that is, to an analysis of 
variance of y; — bx;;. The analysis is, in some cases, a sophisticated 
version of an analysis of difference or change scores. Provided that 
initial score x is moderately or highly correlated with final score y, the 
difference between the two has a lower error variance than either of 
them taken separately; and so the test of significance is more sensitive 
than an analysis of variance applied to y alone. 

In the observational employment of analysis of covariance, allotment 
of persons to treatment groups cannot be random because the ‘treat- 
ment’ is some given characteristic, such as presence oF absence of 
psychosis. In consequence, the x variables may be correlated with the 
treatment variables. The purpose of analysis of covariance in this situa- 
tion differs from its purpose in the experimental, randomized, situation. 
In the experimental type of analysis, the hypothesis tested by analysis 
of covariance is identical with the hypothesis tested by analysis of 
variance. The only difference lies in the sensitivity of the test. In the 
observational situation the two hypotheses differ in so far as a com- 
parison after covariance adjustment is not the same as the comparison 
before covariance adjustment. Typically, the effect of covariance adjust- 
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ment in this situation is the removal of a certain part, aspect, or correlate 
of both y and the treatment effect. The analysis then tests the extent of 
the relation between the adjusted y and the adjusted comparisons. 

A third situation may be mentioned. This resembles the experimental 
situation in that allotment of persons to treatment groups is random. 
It differs from the pure experimental situation, however, in that it is 
possible for the treatment to affect the x variables. Covariance adjust- 
ment of the y variables for the x variables then usually implies that the 


research worker is looking to see how far the effect of the treatment on 
y is mediated by its effect on x. 


Example 


The following arithmetic example is a bivariate case, in the sense that 
it contains only one independent variable and one dependent variable. 
It is a special case of multivariate analysis, and the multivariate for- 
mulae will be presented at appropriate points in the exposition. There 
are only two groups and so the equation is the simple one with a single 
comparison, 
Vis = Ci + bxy + ey 
Twelve boys have been 

random to two methods of 
the old method and the ne 


boys to learn better, whatever the method of teaching. Since we have 
used random assignment, 


be any significant difference in the mean intelligence of the groups. (In 
fact, in this artificial exam 


is therefore of the experimental, rather than the observational type. The 
extent of the difference be 


Old method New method 
Intelligence(x)  Reading(y) Tntelligence(x) Reading(y) 
4 4 2 6 
2 2 3 4 
4 0 2 6 
2 2 4 8 
3 0 3 4 
4 4 4 8 
Mean 3 2 3 6 
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The sums of squares and sums of products within each group are cal- 
culated in the usual way. 


Woia Wrew 
x oy x F 
¥ 4 4 x 4 4 
y |4 16 y |4 16 


Common sense told us that intelligent boys would learn better, what- 
ever the method of teaching. Let us elaborate this assumption by further 
assuming that the extent of the association between reading attainment 
and intelligence is the same for both methods of teaching. In other 
words, we expect the regression of reading score on intelligence to be 
the same in both groups. Both these assumptions are testable, and 
methods for testing them are given in the two following sections. Here 
it is enough to observe that the two within group sums of squares and 
sums of products matrices are identical, and so there is no reason to 
doubt that both sample regressions are estimates of a common popula- 
tion regression. We may therefore add the two matrices to obtain the 


combined within group matrix W, 


x y 
a |e xl 
y [8 32| 


The estimated regression of y on x, which may be symbolized by.» is 
calculated from W. In the multivariate case W is a partitioned matrix 


like R in chapter 6. 


The regression of the y variables on the x variables is then the matrix 
Y.: 


Bys = Wy:Wer” 


In the ordinary bivariate case of our example, this matrix formula 
reduces to 
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This coefficient is the slope of the regression of reading score on 
intelligence score within groups. It tells us the number of extra units of 
reading score which follow upon a rise of one unit in intelligence score, 
when all individuals have received the same treatment. In this par- 
ticularly simple example a unit increment in intelligence leads us to 
predict an increase of b = 1 in reading score. , 

In analysis of covariance, as in analysis of variance, the base-line of 
a variable is its overall mean. The two overall means are 3 for intelli- 


gence and 4 for reading. We express every score as a deviation from its 
overall mean, 


Old method New method 
Intelligence(x)  Reading(y) Intelligence(x)  Reading(y) 

+1 0 =À +2 

f =) 0 0 

0 —4 = +2 

=l =2 +1 +4 

0 ~4 0 0 

+1 0 +1 +4 

Mean 0 —2 0 +2 


We now calculate the reading scores adjusted for intelligence; that is, 
for each boy we calculate y — bx. 


Old method New method 


=l +3 
—1 0 
= +3 
=i +3 
—4 0 
al F 
Mean —2 +2 


When reporting the unadjusted and the adjusted means of the groups 
on the y variable, it is customary to add the overall mean on y to every 
group mean. Adding 4 to each group mean gives us the means, 
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Old method New method 
Unadjusted 2 6 
Adjusted 2 6 


What has been gained by the covariance adjustment? The means of the 
groups are, in this example, exactly the same after adjustment as they 
were before it. This follows from the fact that the two groups have 
exactly the same mean on x. In a typical experimental situation the 
group means on x will not be identical, but they will not be greatly 
different; and so the means of the y variables will be only slightly dif- 
ferent after covariance adjustment from what they were before it. What 
then has been achieved? The answer is that the precision of the experi- 
ment has been to some extent improved by the adjustment, in the sense 
that the within group or error sum of squares has been reduced. To 
show this, we apply analysis of variance calculations to both the un- 
adjusted and the adjusted reading scores. Only one slight modification 
is required, and that is the omission of one degree of freedom from the 
error row of the adjusted analysis to allow for the fact that b has been 
estimated from the within group matrix. 

The analyses may be set out side by side and F ratios may be calcu- 


lated from each in the usual way. /n this particular example the analysis 


of variance of the adjusted scores is identical with the analysis of co- 


variance. But the reader is warned that this identity holds only because 
both groups have the same mean score on x. If, as almost invariably 
happens with real data, the group means are not identical, the between 
groups mean square in the analysis of adjusted scores over-estimates 
the numerator of F in the analysis of covariance. For this reason the 
between or comparison sum of squares and mean square are enclosed 
in brackets, although in this case they happen to be accurate. The first 
analysis is of y and the second is of y — bx. 


Unadjusted Adjusted 
Source df, SS MS df SS MS 
Comparison (Between) 1 48 48 1 (48) (48) 
Error (Within) 10 32 3-2 9 24 2:67 
Sum (Total) ~ it 80 i072 


their sums of squares in the usual way 


from the three sources between, within and total, these sources have 
been given new names. These names help to clarify the structure of tests 
of significance of designed comparisons in analysis of covariance. 

It can be seen that adjustment for x decreases the error variance of y 
from 3-2 to 2:67. In consequence, the variance ratio increases from 
Fyag = 15 to Fio = 18. The effect of the increase is to some extent 


Although the two analyses derive 
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offset by the loss of a degree of freedom in the calculation of b from 
the data. 

The example illustrates the experimental employment of analysis of 
covariance for the purpose of increasing precision. The advantage of 
increased precision is that a particular treatment effect may be demon- 
strated with a smaller experiment, and hence with less cost and labour 
in collecting data. On the other hand, the example also illustrates that 
the benefits of covariance adjustment are quite small when the within 
group correlation between the x and y variables is low or moderate. The 
correlation between intelligence and reading, calculated from W, is 


V8 4/32 OS 

Higher correlations than this are desirable if analysis of covariance is to 
be of substantial benefit. The psychologist and the social scientist very 
often do not achieve such correlations, because their measures are un- 
reliable. If the research worker is wondering whether analysis of co- 
variance is worth while, he may calculate the approximate value of the 
adjusted error mean square if he knows the unadjusted mean square 


(sj, = 3-2) and the within group correlation of x and y (r = 0-5). The 
formula is 


f=2 


where f = 10 is the error degrees of freedom. In our example this yields 


sl — afi S r=] 


AE E RS 
3-2(1 id ! wa =27 
which is quite close to the adjusted mean square of 2-67. 

The data of the example have been subjected to adjustment followed 
by analysis of variance in order to demonstrate the nature of the analysis 
of covariance. Now we shall re-analyse the example following the usual 
textbook rules for performing an analysis of covariance. The sums of 
squares and sums of products may be set out in rows as follows: 


Source d.f. Se Sxy a Bye 
Old method 5 4 4 16 1 
New method 5 4 4 16 1 
Error (Within) 10 8 8 32 1 
Sum (Total) ll 8 8 80 


In chapter 6 it was stated that, in the case of a partitioned correlation 
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matrix, the portion of the matrix R,, which is accounted for by the x 


variables is 
R,zRsx Ray 


and the deviation from regression is 
Ryy ads RyzRez Ray 


Similarly, in the case of a sums of squares and sums of products matrix 
such as W, the portion of Wyy which is accounted for by the x variables 


1s 
Wy 2 = WoeWes "Wey 


= By2Wey 
and the deviation from regression is 
Woy = Wiz 


In the bivariate case the sum of squares for regression is 


subtracted from >y%, gives the deviation sum of 
f squares of that part of the y variable which 
cannot be accounted for by x. The deviation sum of squares for error 
is the within group sum of squares of y — bx, which we have already 
calculated. The formula for calculating a deviation sum of squares may 
be applied to any row of the preceding table. For the purpose of testing 
significance, it must be applied to both the error and the sum rows, thus 
(the Comparison row contains the elements of the between groups 


and this quantity, 
squares, that is, the sum o 


matrix): 
Source df. 3 D>xy SY Ox? Deviation df. MS 
=~ SS 
Comparison 1 0 0 4 
Error 10 8 8 32 8 24 9 267 
Sum 11 8 8 80 8 72 10 
48 1 48 


Adjusted means 
The sum of squares and degrees of freedom in the last row are obtained 
from the two preceding rows by subtraction. The variance ratio is 
8 
= 18-00. 


obtained from the last two columns and is Fi,9 = B67 
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In the multivariate case the column headed Deviation SS contains 
determinants of two deviation matrices, one in the error row and the 
other in the sum row. The adjusted means row is omitted. The error 
deviation matrix is the matrix W,,... The sum deviation matrix is 


Sie = Sy = SySy2"Siy 


If C is the sums of squares and sums of products matrix in the com- 
parison row, and W is the matrix in the error row, then we may write 
S = C + W. S is partitioned in the same way as W and R to give the 


submatrices S,„,, ete. The L-criterion for the test of significance is the 
ratio 


[Waal 
[Syl 
and may be tested by the application of Rao’s (1952, p. 261) approxi- 
mate F test. Alternatively, chi square may be calculated as 
X = —[p—x—1— My +q + 1) log,L 
d.f. = pq 


L 


where x is the number of covariance variables, y is the number of 
dependent variables, p is the degrees of freedom of the sum matrix S, 
and q is the degrees of freedom of the comparison matrix C, 


Designed comparisons 


The purpose of changing the titles of the sources of variance was to 
remind the reader that, whatever the comparison to be tested, the sums 
of squares and sums of products matrix C must be added to an error 
matrix in order to obtain a sum matrix S. If there is more than one 
possible source of between groups variance—as there would be, for 
example, in a factorial design—all the sources may be tested for signi- 
ficance by following the procedure for each source in turn. It is not, of 
course, necessary to test every possible comparison. Sometimes a factor 
is introduced, not because the experimenter is interested in its effects, 
but because he wishes to eliminate those effects from the error variance. 
Sex is often introduced as a factor for this purpose. The experimenter 
suspects that men and women may differ in their mean y, and so he 
isolates this difference but does not analyse it further. If an educational 
experiment is carried out in several schools, the differences among the 
schools may be similarly excluded from the analysis. 

In a factorial design, an overall test of all the factors is made by 
treating the between groups sums of squares and sums of products 
matrix exactly as C was treated in the example. The C matrix for a 
particular factor may be calculated by applying an appropriate dummy 
weighting vector to the group means and calculating sums of squares 
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and sums of products. Having obtained matrix C for a comparison, the 
experimenter must decide whether to estimate his error row from the 
within group matrix, as in the example, or from an interaction matrix. 
The decision depends on the model to which the experiment conforms. 
The principles which hold for models in the analysis of variance hold 
also for analysis of covariance. Readers who are unfamiliar with the 
two main models should consult Eisenhart (1947), Snedecor (1956), and 
Plackett (1960). Snedecor may also be consulted for actual examples of 
the calculation of C by the application of weighting vectors. His method 
is to apply the weights to sums rather than means and then to apply an 
appropriate correction. 

In assessing the significance of a particular comparison C, it may be 
valuable to assess also the significance of B — C, that is, the significance 
of the part of the non-error variance which the chosen comparison does 
not account for. This may be accomplished by testing B — C in exactly 
the same way as C. We calculate S = W + (B — C) which, since 
T = W + B, is equivalent to S = T — C. The L-criterion is again the 
ratio of |Wy.s| to |Sv.«|- Ifa certain factor has been introduced purely 
as a means of reducing error variance, the experimenter will not want 
it to inflate the test of whether C adequately accounts for the differences 
among the groups. In this case the sums of squares and sums of products 
matrix for the irrelevant factor should be subtracted from T before S 
is calculated. A corresponding correction is made to the degrees of 
freedom p which enter into the multiplier of chi square. 

A designed comparison between the means of only two groups, let us 
say groups j and k, may be calculated by employing a weighting vector 
with a positive weight for j, a negative weight for k, and zero weights 


for all other groups. 


Test of whether the population regression differs from zero 


In making covariance adjustments, & knowledge of the size of the within 
group correlation is more important than an assessment of its signifi- 
cance. Nevertheless, it is sometimes useful to know whether the slope of 
the error regression differs significantly from zero. The test of the null 


hypothesis b,,, = 0 is made as follows: 


Source af SS Ms 
Error of unadjusted means 10 32 8-00 
Fis=a = 3-00 
Error of adjusted means g 24 26 ae 
1 8 800 


Regression coefficient 
last row are obtained 


sum of squares in the 
ted in the usual way. 


The degrees of freedom and qua 
jance ratio is construc 


by subtraction, and the var 
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Test of whether the group regressions are estimates of a common popula- 
tion regression 


In analysis of covariance we estimate the slope of the regression line 
by. from a weighted average of the regressions within the samples. It 
may be desirable to test the implied assumption that all the sample 
Tegressions are estimating the same population regression; in other 
words, that treatment does not affect regression. The test involves cal- 
culating the sum of Squares for deviation from regression 


e _ QF 
Èy Se 
for each group separately. These deviation SS are pooled, and the 


pooled value is subtracted from the deviation SS within groups. The 
degrees of freedom are similarly pooled and subtracted. 


a an 


Source d.f. Xx Sxy 5y? Deviation SS df. MS 
Old method 5 4 4 16 12 4 

New method 5 4 4 16 12 4 

Pooled 24 8 3-00 
Error 10 8 8 32 24 9 
Differences among 

regressions 0 1 0 
Set tg 


The variance ratio is zero, because the regression coefficients within the 
two groups are identical. 


Interpretation 


It should be evident that the interpretation of an experimental analysis 
in which the treatments cannot affect the covariance variables is no 
more difficult than the interpretation of a straightforward analysis of 
variance of the y scores. Both the analysis of variance and the analysis 
of covariance test the same hypothesis. The difference lies in the efi- 
ciency with which the hypothesis is tested. This is an instance of the 
simplicity of dealing with uncorrelated variables. The x variables are 
uncorrelated with the comparison variables (let us call them the c vari- 
ables), and so eliminating the x variables does not eliminate any part 
of the treatment differences. The result is, in effect, an enhanced cor- 
relation between c and the remaining part of y. 


| 
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In most observational work, and in some experiments, x may be 
correlated with the treatment effects. In other words, an analysis of 
variance or analysis of dispersion applied to the x variables could yield 
a significant result. (It is not always possible to apply analysis of vari- 
ance to a covariance variable, because a variable may be employed as 
a covariance variable even though it is not normally distributed.) The 
practical research worker and the theoretical statistician take different 
views of this case. The research worker sees it as an opportunity for 
introducing experimental controls in a non-experimental field, and the 
statistician sees it as a reason for exercising the utmost caution and, 
often, as a sufficient reason for refusing to carry out analysis of co- 
variance at all. The research worker may appreciate the statistician’s 
point of view if he reflects on the relations between regression analysis 
and the ordinary partial correlation coefficient. Partial correlation has 
been described as ‘a lazy man’s substitute for experimental control and 
fractionation of data’ (Guilford, 1956, p- 318). But the formula for 
partial correlation is derived from the regression formula employed in 
both canonical analysis and analysis of covariance. If we wish to partial 
out the effect of variable one on the correlation between variables two 
and three, we may write the correlations as elements of a partitioned 


matrix, 


The formula R,, — RysRr2 "Rov applied to this matrix, yields 


1 — rar Tog — Torts 


F32 — Msil12 1 — Fais 


parts of the second and third variable 
t be accounted for by, the first. Since 
rrelation between the residual parts 


This residual matrix contains the 

which are unrelated to, or canno! 
i A 

farig = ris and raris = is the correl 

of variables two and three may be written 

Fog — Tals 

-e o 

farm 3 T) 

wa = 7 — rev — =) 


which is the usual formula for the partial correlation coefficient. 


L 
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Any research worker who has struggled to interpret a partial correla- 
tion, or a set of partial correlations, must be aware of the problems 
involved when both variable two and variable three are correlated with 
variable one. The analogy with analysis of covariance may be drawn 
by letting variable one represent the covariance variable x and variable 
two represent y. Then variable three stands for the comparison or 
treatment variable c. If the correlation rj,(r,,) is zero, the partial cor- 
relation formula reduces to 


ee Ea, 
V(l — ri) Vl = rey) 


which is quite simple to interpret. It is the ratio of the cosine between 

cand y in their own space to the cosine between y and the projection of 

y in a space at right-angles to x. Since c is at right-angles to x, its vector 

is necessarily in the latter space. If x happens to lie in the plane of y and 

c, then the projection of y in the orthogonal space is coincident with ¢ 
— and the partial correlation is perfect. 

The difficulty of interpreting a partial correlation when x is cor- 
related with c is associated with the difficulty of visualizing the simul- 
taneous projection of y and c on to a space at right-angles to x. 

We shall now look at an example of analysis of covariance in which 
there is a significant association between x and the treatment compari- 
son. The following data differ from those of the previous example only 
in that the boys learning by the old method are, on average, two points 
of intelligence below those learning by the new method. 


Old method New method 
Intelligence(x) — Reading(y) Intelligence(x) — Reading(y) 


3 


N FONONA 
A UBMNwWRW 
DH CHRODHRA 


1 
2 
1 
2 
3 
Mean 2 


The sums of squares and sums of products within groups are the same 
as before, and the analysis of covariance is* 


* The reader may care to calculate the analysis of variance for the adjusted 
scores y — bx as it was calculated for the first example. It will be found that 
the between groups sum of squares is 12, which is a gross over-estimate of the 
true value of 4-8. The analysis of variance of adjusted scores is sometimes 
employed as an approximation to an analysis of covariance, but its use is 
justified only if the differences among the groups on the x variables are small. 
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Source df. Sx* Sxy Sy? (Sxy)? Deviation d.f. MS 
> SS 
Comparison 1 12 24 48 
Error 10 8 8 32 8 24 9 2:67 
Sum 11 20 32 80 (Si2 28-8 10 
Adjusted means 48 1 4-80 
4-80 


The variance ratio is F}, = 67 = 1-80 and is not significant. Analysis 
of variance shows a significant relation between intelligence and the 
group to which a boy is assigned (F,,19 = 15-00). The effect of partial- 
ing out intelligence is to eliminate the superiority of the new teaching 
method over the old. Strictly speaking, we cannot say that the new 
method is not superior, but only that it has not been shown to be 
superior. The adjusted mean for the new method is 5 and that for the 
old method is 4. 

The straightforward interpretation of this analysis exemplifies the 
sort of use to which the research worker wishes to put analysis of co- 
variance. But the example is susceptible of a simple and unambiguous 
interpretation only because we have a great deal of prior information 
about the field of study. We know that intelligence is a cause or ground 
of learning ability, and not vice versa. If the pupils in one sample learn 
better than those in another, and are also more intelligent, we are con- 
tent to treat the difference in intelligence as at least a partial explanation 
of the difference in learning. Furthermore, we are willing to apply our 
regression coefficient, estimated from the within group Tow of the 
analysis, to the whole range of intelligence included in the experiment, 
because the two groups do not diverge too widely in their average 
intelligence. If one group had been undergraduates and the other defec- 
tives, we should not have pretended to extrapolate from the regression 
within groups to the ferra incognita lying between them. Again, we 
know that it is in principle possible to remove the association between 
intelligence and the comparison by allotting boys to groups at random. 
In arguing from our correlated case to an uncorrelated case, we are not 
pretending to construct by statistics a situation which could never, in 
fact, exist. 

In many analyses, particularly those derived from observational 
material, some or all of these desiderata may be lacking. Would it be 
sensible to partial out age in a study of acute hebephrenic and acute 
paranoid schizophrenics? The average hebephrenic in the acute stage 
of his illness is, let us suppose, about 20 years of age, whereas the 
average paranoid is about 35. The conclusions of an analysis of co- 
variance would purport to refer to a comparison between hebephrenia 
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at age 28 and paranoid schizophrenia at the same age, and would be 
atypical even if they were not false. . 

The analysis of covariance seems, at first sight, to offer a statistical 
substitute for experimental control of variables. But the substitute is 
adequate only when persons are assigned to treatment groups at random 
and the covariance variables are such that they cannot be affected by 
treatment. In so far as covariance adjustment establishes a condition of 
orthogonality which does not exist in the real world, the conclusions are 
open to the difficulties of interpretation which beset any contrafactual 
conditional statement. 

Even when differences between groups on the x variables are not 
significant, a significant conclusion to the analysis of covariance cannot 
be held to convey definite information. Only if allotment of persons to 
groups has been truly random are we in a position to state that the 
design has ruled out the possibility of systematic bias by extraneous 
variables. If assignment of persons to groups is non-random, or if the 
groups are naturally-occurring or self-selected, then it is always possible 
to argue that the significant association between y and treatments is a 
consequence of a characteristic which could have been included as an 
x variable but which was not. Analysis of covariance is unambiguous 
when it partials out, and can only partial out, aspects of y. Interpretation 
becomes much more difficult if (a) it may be partialing out part of the 
treatment variable c and (b) absence of randomization has left a way 


open for any number of extraneous variables to bring their influence to 
bear on the figures. 


Assumptions 


A problem which may arise in applying analysis of covariance is curva- 
ture of regression of y on x. It may be, for example, that x is linearly 
related to the logarithm of y rather than to y itself. Since analysis of 
covariance assumes linearity of regression, the obvious solution is to 
substitute log (y) for y in the analysis. It is sometimes useful to define 
several covariance variables by introducing different powers of a single 
variable, x, x?, x°, etc. 

Analysis of covariance makes the usual assumptions of normality and 
homogeneity of variance of the y variables. The x variables need not be 
normally distributed, but for every value of the x variable y must be 
normally distributed with the same variance in all groups. The regres- 
sions of y on x in the various groups should all be estimates of a 
common within group regression. 


Canonical analysis of scores after covariance adjustment 


The example at the beginning of the chapter showed how the scores of 
individuals are adjusted for covariance. It may sometimes be useful to 
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perform a covariance adjustment before calculating a discriminant func- 


tion or canonical analysis of discriminance. It will be recalled that the 
formula for a two-group discriminant function is w = Vd and the 
squared distance between the groups is D? = d’V—d = d'w, where d 
is a vector of differences between the group means and V is the within 
group sums of squares and sums of products matrix divided by 
(a, — 1). If a covariance adjustment has been performed, d is the 
vector of differences between the group means on the adjusted y vari- 
ables and V is Wy, divided by S(m, — x — 1) where x is the number 
of covariance variables. The equations then hold for adjusted variables 
exactly as for unadjusted variables. Discriminant function analysis with 
adjustment for covariance is most easily conceived as a two-stage 
process. The first stage is the covariance adjustment of each individual 
score y — bx. Here y and x may be single values or vectors, and b may 
be a single coefficient or a matrix of coefficients. The adjusted scores are 
then subjected to a discriminant function analysis with the minor 
modification of the within group degrees of freedom which has been 


indicated. 

The extension to multipl 
of discriminance is obyiou 
vector of adjusted means of the group on 
canonical variate is standardized so that x'Vx = 1, the value of the sum 
of squares within groups becomes S(n, — x — 1): 

There are perhaps hardly enough published examples of canonical 
analysis with covariance adjustment to enable any assessment to be 
made of the value of the method. Some statisticians doubt whether it 
can be superior to a straightforward canonical analysis which employs 
both the x and y variables and makes no logical distinction between 
them. Furthermore, as with analysis of covariance, interpretation will 
be difficult unless the x variables are independent of the treatments. On 
the other hand, the typical social or psychological variable is a con- 
taminated measure, and the conclusions of the analysis are likely to 
stand out more clearly if the contamination can be removed before the 
analysis begins. It is evident, however, that the research worker must 
have a sound idea of what it is that the various parts of his variables 
measure, and hence of which characteristics are eliminated and which 
are retained in the analysis. When a research worker adjusts or ‘corrects’? 
for the x variables, he must not suppose that the means of the y — bx 
scores are necessarily more correct, real or objective than were the 


means of y. 


e discriminant analysis and canonical analysis 
s. In calculating V-'m for a group, m is the 
the y variables. When a 
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Chapter10 The Relative Importance of Variables 


The relative importance of variables in a regression equation is usually 
determined by applying one or other of a family of methods. There are 
five main methods in this family, namely: 

(1) The calculation of R? first for predictor variable one, then for 
predictor variables one and two, then for predictor variables one, 
two and three, and so on. The analysis terminates when the incre- 
ment in R? brought about by adding a new variable fails to satisfy 


a certain criterion. 
(2) The calculation of R? for all £ variables, for all variables less 
bles less variables ¢ and t — 1, and so on. 


variable #, for all varia 
The analysis terminates when the decrement in R? brought about 


by omitting a variable exceeds a certain criterion. 
(3) The calculation of `R? for the variable which has the highest 
r with the criterion (R =r’) and the 


individual correlation 
addition of the variab 
R?. The analysis terminates w 


satisfy a certain criterion. 

(4) The calculation of the k values of R? for k arbitrarily chosen, 
and perhaps overlapping, subsets of variables, and the choice of 
that subset which yields the largest R°. 

(5) The calculation of R° first for all variables less variable one, then 
for all variables less variable two, then for all variables less vari- 
able three, and so on. A decision is made to neglect a variable if 
its omission does not cause R? to drop significantly below the 
value of R? for all the variables. 

The variance ratio for testing whether a regression based on p +4 
variables is significantly better than a regression based on p variables is 
given by Rao (1952, p. 252) in terms of Mahalanobis’ D>. 

_ Although this family of methods tends to be favoured by statisticians, 
it raises some statistical difficulties. In particular it maximizes the danger 
of capitalizing on chance errors in the data, and it raises difficulties 
concerning the sequential employment of tests of the significance of the 

In addition, methods (1) and (2) 


difference between two values of R°. 
ker prejudges the issue by determining the 


demand that the research wor 
order in which variables enter the analysis. The methods were devised 


to deal with the simple pro 


Je which produces the greatest increase in 
hen the increment in R? fails to 


blem of predicting a single, given criterion 


157 
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variable. They would be difficult to apply to the two sides of a canonical 
correlation analysis. 

The research worker’s basic objection is that these methods are not 
designed to answer his problem. They do not, in fact, yield estimates of 
the relative importance of variables. They simply tell us that variables 
in a certain subset shall have non-zero weights, and that all other vari- 
ables shall have zero weights in a prediction equation. Indeed, it is quite 
possible for no variable to meet the criterion implied by method (5). 
Furthermore, if the matrix of predictor intercorrelations has one or 
more near-zero roots, there is no stable solution to the choice of a 
subset of variables. 

The statistician may quite truly reply that the research worker’s 
problem is unanswerable. It is the inverse of the economic and social 
problem of trying to estimate the relative rewards due to different groups 
of workers (productive units) in the division of the national product. In 
an economy which takes significant advantage of the benefits of speciali- 
zation and division of labour, the whole product is more than the sum 
of the inputs of productive units. Similarly, though inversely, the vari- 
ables in a regression equation, in so far as their predictive variance is 
correlated (that is, in so far as there are several variables inefficiently 
doing a job which could be done by only one variable), are competing 


to do the same job, and the result of their combined efforts is less than 
the sum of their inputs. 


In meeting the problem of distributing the national income among 
productive units, the social administrator faces two alternatives. He can 
allow the productive units to fight it out among themselves, or he can 
impose some sort of incomes policy. In practice almost all western 
European and American administrations have chosen, or are moving 
towards, an incomes policy, and the reasoning behind this trend is 
relevant to our problem. The trouble with choosing competition is not 
only that competition is disruptive of production, but also that the 
choice can never result in a stable solution since, in a complex economy, 
almost every productive group is undervalued, if its price is assessed at 
the loss which ensues when it withdraws its labour. Thus deletion of a 
productive unit over-estimates the weight which should be given to that 
unit, just as, inversely, deletion of a variable underestimates the weight 
that should be given to that variable. The administrator’s solution is to 
impose a compromise, in which an attempt is made to scale down the 
claims of each unit so that the total claimed does not exceed the amount 
available for distribution. The result cannot claim any sort of scientific 
validity; nevertheless, it is a workable solution of the problem. y 

The practical problem of choosing an efficient set of predictors is 
quite different from the theoretical problem of determining which vari- 
ables are important in a particular field. Even when all the variables 
are measured with a uniform and high degree of validity, the latter 


The Relative Importance of Variables 159 


problem is insoluble by computational methods alone. The question of 
the relative importance of variables in a regression equation is an ill- 
defined question. The calculation of coefficients of separate determina- 
tion imposes a definition which cannot be the best for all purposes, but 
which may often be a workable definition. Coefficients of separate 
determination may be calculated for any regression equation, whether 
it results from a multiple regression analysis, a canonical correlation 
analysis, a two-group discriminant function analysis, or a canonical 
analysis of discriminance. The regression weights should be set out as 
the diagonal elements of a diagonal matrix B, that is, a matrix which 
has zeros everywhere except in the principal diagonal. The coefficients 
of separate determination for the regression equation are calculated as 


BABI 


where lis a colufn vector of unities, and the matrix A varies according 
to the nature of the analysis. In an ordinary multiple regression analysis 
in which there is one criterion variable, the matrix A is the matrix of 
correlations among the predictor variables. In a canonical correlation 
analysis the formula may be applied to both sets of regression weights 
associated with each canonical root, and each diagonalized regression 
equation is multiplied by its appropriate matrix of correlations. Ina 
canonical analysis of discriminance A is the total dispersion matrix. The 
sum of squares of the regression weights, and the absolute size of the 
elements of A, are immaterial to the calculations. For example, the total 
dispersion matrix may be replaced by the total sums of squares and sums 
of products matrix. To facilitate the comparison of regression equations, 
the sum of the coefficients of separate determination may be set equal 
to R?, D? or the canonical root. Alternatively, each coefficient may be 
expressed as a percentage or proportion of the sum of all the coefficients. 

The matrix formula given above is a general formula which includes 
as special cases the simple methods of calculation appropriate to a 
multiple regression analysis (chapter 6) or a two-group discriminant 
function analysis (chapter 7). When the formula is applied to the case 
of two predictor variables and a single criterion variable, the two co- 
efficients of separate determination, /,, are given by 


h = b? + ry by by 


2 
2 


ha = ra by by + 


where ry. = ra is the correlation between the two predictor variables 
and b; and b, are their respective regression weights. b,? represents the 
isolated influence of a variable in the determination of R?. The conjoint 
influence of two variables, which is represented by r,;b;b;, is (arbitrarily) 
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split equally between the two variables concerned. The value of the 
conjoint influence may be positive or negative, according to the signs 
of its three elements. A negative conjoint term, or a series of such terms, 
may outweigh the squared regression coefficient, which is necessarily 
positive, and so result in a negative coefficient of separate determination. 
A negative coefficient does not imply that the variable in question is 
detracting from the value of R®. No variable in a regression equation 
can lower the effectiveness of the prediction within the sample from 
which the equation was derived. 

Negative coefficients of separate determination cannot occur if the 
vector of correlations between the criterion variable and the predictor 
variables is equal (apart from a constant multiplier) to a latent vector 
of the matrix A. This is true whatever the size of the latent root asso- 
ciated with the component, and it follows from the fact that 


A“f = àf 


where f is a latent vector of a non-singular symmetric matrix A. This 
equation implies that any latent vector of A is also a latent vector of 
the inverse of A, and that the two associated latent roots are reciprocal 
to one another. In calculating regression weights, we multiply a vector 
of correlations by A—, If the vector of correlations is a latent vector of 
A” and therefore of A, the resulting vector of regression weights will 
also be a latent vector of A. In calculating coefficients of separate deter- 
mination we pre- and postmultiply A by a diagonal matrix whose 
diagonal elements are the elements of the vector of regression weights. 
If that vector is a latent vector of A, the resulting coefficients of separate 
determination will be the squares of the elements of the latent vector of 
A (apart from a constant multiplier). 
We have been discussing the relation between a criterion variable and 
a number of predictor variables, and we have seen that the relation is 
particularly simple when the criterion variable projects into the pre- 
dictor space along the line of one of the principal axes of that space. 
The degree of the association—that is, the value of R or R?—is irrelevant 
to the relationship. The cosine between the criterion vector and the 
regression line in the predictor space may be quite small; but so long 
as it is not zero, and so long as the criterion variable is associated with 
one and only one component of the predictor variables, then the cal- 
culation of coefficients of separate determination is simply a matter of 
squaring the elements of the relevant latent vector, and none of these 
squared values can be negative. 
Let us now look at a different, but related, problem, that of relating 
a criterion variable to the components, rather than to the variables, of 
the predictor space. The solution of the problem is of practical impor- 
tance (chapter 6) because it reveals how far any particular analysis 
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conforms to the condition that all the predictive power of a battery is 
due to a single component. The problem, however, also has a theoretical 
interest in that it further clarifies the nature of the coefficient of separate 
determination. 

Let us suppose that we have calculated the score of each person on 
each component of the predictor matrix. The ‘variance-covariance’ 
matrix of these scores is the diagonal matrix of latent roots A. Because 
A is a diagonal matrix, its inverse is particularly easy to calculate. Each 
diagonal element of the inverse is simply the reciprocal of the corre- 
sponding diagonal elements of A, and all the other elements are zero. 
A~? is multiplied by the vector of covariances between the components 
and the criterion in order to arrive at the regression weight of each 
component. These weights are entered in the principal diagonal of the 
diagonal matrix B. The coefficients of separate determination for the 


components are calculated as 
BABI 
gonal matrices, the product BAB is also 


duct the ith element of the principal 


diagonal is À;b®, which must be non-negative. In consequence, the co- 
efficient of separate determination for each component is necessarily 
non-negative. The sum of the coefficients of separate determination 1s 


Because B and A are both dia 
a diagonal matrix. In this pro 


equal to R°. 

The simple relation between R? and the coefficients of separate 
determination for the components furnishes a solution to a problem 
which sometimes crops up. A large predictor correlation matrix may 
prove to be singular, so that no inverse can be found for it. Scores on 
components may then be substituted for scores on the predictor vari- 
ables, If a large proportion of the value of R? is attributable to a certain 

nents may be discarded. 


subset of components, the remaining compo carde 
The regression weight of a variable may be calculated by multiplying 


its weight in a normalized component by the regression weight for that 
component, and summing over all the retained components. 

Coefficients of separate determination cannot be accepted at their 
face value unless they refer to uncorrelated, perfectly valid, measures. 
It is sometimes advisable to look at the values of all the terms which, 
when summed, give the value of an individual coefficient. In this way 
the structure of each coefficient may be studied. A sufficient, but not a 
necessary, condition for a zero coefficient is a zero regression weight. A 
predictor variable which is not correlated with any of the other pre- 
dictors cannot have a negative coefficient. 


/ 
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